mso-dumper: utf16 conversion

jf at dockes.org jf at dockes.org
Mon Nov 18 01:35:32 PST 2013


Hi,

I am having a look at making a PowerPoint text extractor out of mso-dumper. 

While doing this, I found that the routine used to convert utf-16 text out
of the TextChars Atoms was not working for me.

The new version (first attached patch) was tested on a variety of inputs,
including chinese, vietnamese and several European languages, and produces
text corresponding to what libreoffice displays, instead of outputting "<xx
invalid chars>" messages. See the commit comment for more detailed
explanations. 

Also the method which processed text out of textBytes Atoms assumed that
these were ascii characters, which sometimes also caused problems (wrong
displays or exceptions).

The new version decodes from cp1252, and works better where I tried
it. Also see the commit message for more details about the choice of
encoding. 

Cheers,

jf

-------------- next part --------------
A non-text attachment was scrubbed...
Name: utf16conv.diff
Type: application/octet-stream
Size: 1991 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20131118/b13294b0/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 8bitbytes.diff
Type: application/octet-stream
Size: 1252 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20131118/b13294b0/attachment-0001.obj>


More information about the LibreOffice mailing list