[poppler] poppler::ustring encoding issue

Jeroen Ooms jeroen at berkeley.edu
Thu Apr 12 08:10:13 UTC 2018


FYI the encoding problems still exist in the master branch today. I am
very interested in this patch by mpsuzuki, what can we do to move this
forward?








On Wed, Mar 28, 2018 at 2:26 PM, suzuki toshiya
<mpsuzuki at hiroshima-u.ac.jp> wrote:
> Dear Adam,
>
> Adam Reichold wrote:
>>> I see. where is the appropriate place to add a document of
>>> poppler::ustring class itself?
>>
>> Personally, I would suggest Doxygen comments in the public header.
>
> Thanks! Now I'm trying to write... also I found Doxygen comments
> for text_list needs the improvement.
>
> During the check of the existing functions (to add documents),
> I found a few inconsistencies about BOM.
>
> * ustring::to_latin1() this function does not use iconv(),
> this function just cast the types between unsigned short and
> char. BOM could not be converted to Latin-1, but the exist of
> BOM is not checked. if stored UTF-16 has a BOM, broken 8bit
> would be inserted in the beginning of the result.
>
> * ustring::from_latin1() this function does not use iconv()
> either. BOM is not inserted to the beginning. no-BOM UTF-16
> string is created.
>
> * ustring::to_utf8() BOM or no-BOM is decided by iconv().
>
> * ustring::from_utf8() assuming iconv() returns with-BOM UTF-16.
>
> I would collect Debian software packages depending libpoppler-cpp,
> and check how they use ustring object. In my rough check it
> would be less than 10, checking all of them would not be so
> time-consuming. If there are softwares which always the skip
> first character of UTF-16 (based on the assumption as the
> ustring is always with UTF-16 with BOM), some discussion is
> needed.
>
> Regards,
> mpsuzuki
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler


More information about the poppler mailing list