[Libreoffice-commits] core.git: sal/qa sal/textenc sw/qa writerfilter/source

Stephan Bergmann sbergman at redhat.com
Sun Sep 24 11:54:36 UTC 2017


 sal/qa/rtl/textenc/rtl_textcvt.cxx             |    1 +
 sal/textenc/tencinfo.cxx                       |    2 ++
 sw/qa/extras/rtfexport/rtfexport2.cxx          |    2 +-
 writerfilter/source/rtftok/rtfdocumentimpl.cxx |    6 +++++-
 4 files changed, 9 insertions(+), 2 deletions(-)

New commits:
commit 53b96765c555146e5c6a3a614420bfeeebc92b58
Author: Stephan Bergmann <sbergman at redhat.com>
Date:   Tue Sep 19 16:26:24 2017 +0200

    Map Windows code page 42 to RTL_TEXTENCODING_SYMBOL
    
    <https://msdn.microsoft.com/en-us/library/windows/desktop/
    dd374130(v=vs.85).aspx> "WideCharToMultiByte function" suggests that there now
    is CP_SYMBOL, "Windows 2000: Symbol code page (42)."  And a little test program
    on Windows indicates that our RTL_TEXTENCODING_SYMBOL is working the same way as
    CP_SYMBOL, where MultiByteToWideChar maps 00..1F to U+0000..1F and 20..FF to
    U+F020..F0FF.
    
    At least CppunitTest_writerfilter_rtftok, when testing
    writerfilter/qa/cppunittests/rtftok/data/pass/EDB-18940-1.rtf, goes into case
    RTF_FCHARSET in RTFDocumentImpl::dispatchValue
    (writerfilter/source/rtftok/rtfdispatchvalue.cxx) with nParam matching
    aRTFEncodings[2] (i.e., a mapping from charset 2 to codepage 42, see
    writerfilter/source/rtftok/rtfcharsets.cxx), then passes 42 into
    rtl_getTextEncodingFromWindowsCodePage and obtains an unhelpful
    RTL_TEXTENCODING_DONTKNOW.
    
    testFdo72031 (sw/qa/extras/rtfexport/rtfexport2.cxx, CppunitTest_sw_rtfexport2)
    needed to be adapted, as the circled plus from the Symbol font is now internally
    represented as U+F0C5, not (somewhat bogusly) as U+00C5 (aka LATIN CAPTIAL
    LETTER A WITH RING ABOVE).  But, when displayed with the Symobl font, the glyph
    that is actually shown remains the circled plus.
    
    Turns out changing rtl_getTextEncodingFromWindowsCodePage would start to make
    CppunitTest_sw_rtfimport fail:
    
      Sep 20 15:49:24 <sberg> vmiklos, with
       <https://gerrit.libreoffice.org/#/c/42477/>, testN823675
       (sw/qa/extras/rtfimport/rtfimport.cxx) fails, the aFont.Name is not "Symbol";
       sw/qa/extras/rtfimport/data/n823675.rtf contains a \fonttbl that specifies
       \f3 to have \fcharset2 (i.e., symbol font) and fontname "Symbol".  However,
       RTFDocumentImpl::checkUnicode
       (writerfilter/source/rtftok/rtfdocumentimpl.cxx)
       converts m_aHexBuffer (containing "Symbol;") with nCurrentEncoding apparently
       being the encoding specified by \fcharset2 (i.e., now RTL_TEXTENCODING_SYMBOL
       instead of old RTL_TEXTENCODING_DONTKNOW), so the resulting OUString is
       garbage
       (instead of the byte-for-byte conversion to Unicode "Symbol;" that
       RTL_TEXTENCODING_DONTKNOW would do there); do you know whether such \fonttbl
       fontnames should actually be interpreted in the given \fcharset?
      Sep 20 15:49:24 <IZBot> gerrit: »Map Windows code page 42 to
       RTL_TEXTENCODING_SYMBOL« by Stephan Bergmann for master [NEW]
      Sep 20 15:51:15 <vmiklos> sberg: let me check if the spec covers that
      Sep 20 15:54:29 <mst_> sberg: i think the name is typically encoded in the
       font's encoding but probably they have to make a (likely undocumented)
       exception for symbol encoding
      Sep 20 15:57:46 <vmiklos> sberg: the spec only says that \fcharset is about
       the encoding of the content using that font, i don't see it described what
       would be the encoding of the font name itself
      Sep 20 15:58:51 <vmiklos> sberg: i'm not sure about if that encoding should or
       should not affect the encoding of the font name in general, but indeed at
       least for 2 (symbol encoding) you're right, Word doesn't encoding the font
       name with that encoding, either.
      Sep 20 15:59:30 <sberg> vmiklos, mst_, at the top of page 14 of
       Word2007RTFSpec9.docx I see "Note that runs of text marked with a particular
       font index (see \fN in the Font Table section) use the codepage for that font
       as given by \cpgN or implied by \fcharsetN, unless they use Unicode RTF
       described in the following section."  Would that match what mst_ says?
      Sep 20 15:59:33 <vmiklos> so if it helps you case to handle at as e.g. ascii,
       just for that encoding, i think there would be no problem with that.
      Sep 20 16:00:07 <vmiklos> sberg: that still talks about the content using the
       font, not the strings (font names) in the font table itself, i think.
      Sep 20 16:01:17 <sberg> vmiklos, what's the control word to select such a
       font, also \fN?  I don't see any such in n823675.rtf
      Sep 20 16:02:16 <mikekaganski> loircbot: e.g. \af3
      Sep 20 16:02:31 <mikekaganski> sberg: ^
      Sep 20 16:02:47 <mst_> 04d5a280beeeb6e056df68395dc9c3b3a674361b
      Sep 20 16:02:50 <IZBot> core - related: fdo#77979: writerfilter RTF import:
       read encoded font name -
       http://cgit.freedesktop.org/libreoffice/core/commit/?id=04d5a280beeeb6e056df68395dc9c3b3a674361b
      Sep 20 16:02:52 <mst_> sberg: ^
      Sep 20 16:04:05 <sberg> mst_, thanks; so there's likely an (implicit?)
       exception for \fcharset2, as you say
      Sep 20 16:04:33 <mst_> that's most plausible, our own font code is full of
       exceptions for "symbol fonts" too
      Sep 20 16:05:19 <sberg> mikekaganski, ENOCONTEXT
      Sep 20 16:05:36 <mikekaganski> sberg: [17:01:16] sberg: vmiklos, what's the
       control word to select such a font, also \fN?  I don't see any such in
       n823675.rtf
      Sep 20 16:06:32 <sberg> mikekaganski, so you say selection is done with \af3
       instead of \f3?
      Sep 20 16:06:40 <mikekaganski> sberg: yes, in that case
      Sep 20 16:07:34 <mst_> i think there are several different keywords that apply
       fonts, but can't remember the whole list
      Sep 20 16:08:10 <mst_> \fN shoudl be one of them though
      Sep 20 16:22:18 <sberg> vmiklos, so who generated that
       sw/qa/extras/rtfimport/data/n823675.rtf, was it manually created and lacks a
       \cpgN before "Symbol"?
      Sep 20 16:29:17 <sberg> vmiklos, (after further reading of the RTF spec):
       disregard the "and lacks a \cpgN before 'Symbol'" part of my above question
      Sep 20 16:30:27 <mst_> sberg: i suggest not reading too much about encoding in
       RTF, it gets pretty lovecraftian pretty fast...
      Sep 20 16:32:58 <vmiklos> sberg: given how short that bugdoc is, i'm pretty
       sure i cut it down manually to something readable from a multi-MB real bugdoc
      Sep 20 16:33:07 <sberg> mst_, do you have a recommendation how I could get
       that "don't use symbol font encoding to read a symbol font's name" into
       writerfilter/source/rtftok/rtfdocumentimpl.cxx?
       RTFDocumentImpl::checkUnicode lacks the context to tell whether it is using
       m_aStates.top().nCurrentEncoding to convert a fontname, and the caller of
       checkUnicode (at the end of RTFDocumentImpl::resolveChars in this case)
       appears to lack the context, too
      Sep 20 16:33:12 <mst_> various Old Ones from The Time Before Unicode and their
       Backward Compatibility Tentacles etc.
      Sep 20 16:34:59 <sberg> vmiklos, anyway, that "so there's likely an
       (implicit?) exception for \fcharset2" hypothesis sounds sane, so we should
       probably implement it (if only you or mst_ can give me a good hint how...)
      Sep 20 16:35:13 <vmiklos> sberg: looking for a code pointer
      Sep 20 16:36:05 <mst_> sberg: m_aStates.top().eDestination ==
       Destination::FONTENTRY should be the relevant check?
      Sep 20 16:36:17 <vmiklos> sberg: RTFDocumentImpl::text() is where the text is
       taken, Destination::FONTENTRY is the state on the parser stack which is a
       font entry in the font table. so to detect "your case" during decoding a byte
       array into a string, m_aStates.top().eDestination == Destination::FONTENTRY
       is what you want
      Sep 20 16:36:35 <vmiklos> ah good, two independent matching hints are
       promising ;)
      Sep 20 16:37:35 <sberg> mst_, vmiklos, ah; but what also looks dodgy is that
       checkUnicode operates there on "Symbol;" including the closing ";" of the
       full <fontinfo>, not just the <fontname> part of the <fontinfo>
      Sep 20 16:39:24 <vmiklos> sberg: i think we already assume that the only
       "token" in the font entry destination that is not bound to a control world
       (\foo) is the font name
      Sep 20 16:40:52 <vmiklos> sberg:
       writerfilter/source/rtftok/rtfdocumentimpl.cxx:1237 is where we simply strip
       away the trailing semicolon, there is no further separation between the font
       name and other character content inside the destination (apart from the
       control words and their arguments)
      Sep 20 16:42:18 <sberg> vmiklos, OK, thanks; I'll just pretend I haven't seen
       those dodgy details :)
    
    ...so I'm switching to (somewhat arbitrarily) RTL_TEXTENCODING_MS_1252 there now
    
    Change-Id: Iebd1bcecb7fa71c489798154d3356062b052775e
    Reviewed-on: https://gerrit.libreoffice.org/42477
    Reviewed-by: Stephan Bergmann <sbergman at redhat.com>
    Tested-by: Stephan Bergmann <sbergman at redhat.com>

diff --git a/sal/qa/rtl/textenc/rtl_textcvt.cxx b/sal/qa/rtl/textenc/rtl_textcvt.cxx
index 3c36852bebfc..0bae6b403088 100644
--- a/sal/qa/rtl/textenc/rtl_textcvt.cxx
+++ b/sal/qa/rtl/textenc/rtl_textcvt.cxx
@@ -3088,6 +3088,7 @@ void Test::testWindows() {
         bool reverse;
     };
     static Data const data[] = {
+        { 42, RTL_TEXTENCODING_SYMBOL, true },
         { 437, RTL_TEXTENCODING_IBM_437, true },
         { 708, RTL_TEXTENCODING_ISO_8859_6, false },
         { 737, RTL_TEXTENCODING_IBM_737, true },
diff --git a/sal/textenc/tencinfo.cxx b/sal/textenc/tencinfo.cxx
index a220d71a4330..94a0cdb1e270 100644
--- a/sal/textenc/tencinfo.cxx
+++ b/sal/textenc/tencinfo.cxx
@@ -826,6 +826,7 @@ rtl_getTextEncodingFromWindowsCodePage(sal_uInt32 nCodePage)
 {
     switch (nCodePage)
     {
+    case 42: return RTL_TEXTENCODING_SYMBOL;
     case 437: return RTL_TEXTENCODING_IBM_437;
     case 708: return RTL_TEXTENCODING_ISO_8859_6;
     case 737: return RTL_TEXTENCODING_IBM_737;
@@ -902,6 +903,7 @@ rtl_getWindowsCodePageFromTextEncoding(rtl_TextEncoding nEncoding)
 {
     switch (nEncoding)
     {
+    case RTL_TEXTENCODING_SYMBOL: return 42;
     case RTL_TEXTENCODING_IBM_437: return 437;
  /* case RTL_TEXTENCODING_ISO_8859_6: return 708; */
     case RTL_TEXTENCODING_IBM_737: return 737;
diff --git a/sw/qa/extras/rtfexport/rtfexport2.cxx b/sw/qa/extras/rtfexport/rtfexport2.cxx
index 7b5d82da3f8e..fba38e6c72cb 100644
--- a/sw/qa/extras/rtfexport/rtfexport2.cxx
+++ b/sw/qa/extras/rtfexport/rtfexport2.cxx
@@ -1507,7 +1507,7 @@ DECLARE_RTFEXPORT_TEST(testFdo85889mac, "fdo85889-mac.rtf")
 
 DECLARE_RTFEXPORT_TEST(testFdo72031, "fdo72031.rtf")
 {
-    OUString aExpected(u"\u00C5");
+    OUString aExpected(u"\uF0C5");
     CPPUNIT_ASSERT_EQUAL(aExpected, getRun(getParagraph(1), 1)->getString());
 }
 
diff --git a/writerfilter/source/rtftok/rtfdocumentimpl.cxx b/writerfilter/source/rtftok/rtfdocumentimpl.cxx
index 5a177bb58ff2..e8483bb43f2f 100644
--- a/writerfilter/source/rtftok/rtfdocumentimpl.cxx
+++ b/writerfilter/source/rtftok/rtfdocumentimpl.cxx
@@ -3209,7 +3209,11 @@ void RTFDocumentImpl::checkUnicode(bool bUnicode, bool bHex)
     }
     if (bHex && !m_aHexBuffer.isEmpty())
     {
-        OUString aString = OStringToOUString(m_aHexBuffer.makeStringAndClear(), m_aStates.top().nCurrentEncoding);
+        OUString aString = OStringToOUString(
+            m_aHexBuffer.makeStringAndClear(),
+            ((m_aStates.top().eDestination == Destination::FONTENTRY
+              && m_aStates.top().nCurrentEncoding == RTL_TEXTENCODING_SYMBOL)
+             ? RTL_TEXTENCODING_MS_1252 : m_aStates.top().nCurrentEncoding));
         text(aString);
     }
 }


More information about the Libreoffice-commits mailing list