[Libreoffice-commits] core.git: Branch 'libreoffice-5-4' - svtools/source

Michael Stahl mstahl at redhat.com
Wed Sep 13 19:42:10 UTC 2017


 svtools/source/svrtf/svparser.cxx |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

New commits:
commit d34e0fdf23f88bb8f8b5db5e4318b3a706bd14fb
Author: Michael Stahl <mstahl at redhat.com>
Date:   Thu Sep 7 23:01:26 2017 +0200

    svtools: HTML import: don't put lone surrogates in OUString
    
    The bytes "ed b3 b5" in fdo67610-1.doc (which, as the name indicates,
    is an HTML file) are converted to the lone UTF-16 surrogate "dcf5",
    which is inserted into SwTextNode and causes asserts later on.
    
    The actual encoding of the HTML document is probably GBK (at least
    VIM doesn't display any missing characters with that), but
    because it doesn't contain any indication of its encoding
    it's apparently imported as UTF-8; the ImplConvertUtf8ToUnicode()
    thinking a surrogate code point is valid even if the Java-compatible
    mode RTL_TEXTENCODING_JAVA_UTF8 is not specified is a bit of a
    surprise.
    
    [note: the master commit says "JSON-compatible mode" but i was
     confusing different text encoding perversions there]
    
    Change-Id: Idd788d9d461fed150171dd907439166f3075a834
    (cherry picked from commit fc670f637d4271246691904fd649358ce2e7be59)
    Reviewed-on: https://gerrit.libreoffice.org/42100
    Tested-by: Jenkins <ci at libreoffice.org>
    Reviewed-by: Christian Lohmaier <lohmaier+LibreOffice at googlemail.com>

diff --git a/svtools/source/svrtf/svparser.cxx b/svtools/source/svrtf/svparser.cxx
index 947ef75a98f3..cb7174f519d2 100644
--- a/svtools/source/svrtf/svparser.cxx
+++ b/svtools/source/svrtf/svparser.cxx
@@ -423,7 +423,8 @@ sal_uInt32 SvParser<T>::GetNextChar()
         while( 0 == nChars  && !bErr );
     }
 
-    if ( ! rtl::isUnicodeCodePoint( c ) )
+    // Note: ImplConvertUtf8ToUnicode() may produce a surrogate!
+    if (!rtl::isUnicodeCodePoint(c) || rtl::isHighSurrogate(c) || rtl::isLowSurrogate(c))
         c = '?' ;
 
     if( bErr )


More information about the Libreoffice-commits mailing list