[poppler] poppler::ustring encoding issue

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Wed Mar 28 12:26:39 UTC 2018


Dear Adam,

Adam Reichold wrote:
>> I see. where is the appropriate place to add a document of
>> poppler::ustring class itself?
> 
> Personally, I would suggest Doxygen comments in the public header.

Thanks! Now I'm trying to write... also I found Doxygen comments
for text_list needs the improvement.

During the check of the existing functions (to add documents),
I found a few inconsistencies about BOM.

* ustring::to_latin1() this function does not use iconv(),
this function just cast the types between unsigned short and
char. BOM could not be converted to Latin-1, but the exist of
BOM is not checked. if stored UTF-16 has a BOM, broken 8bit
would be inserted in the beginning of the result.

* ustring::from_latin1() this function does not use iconv()
either. BOM is not inserted to the beginning. no-BOM UTF-16
string is created.

* ustring::to_utf8() BOM or no-BOM is decided by iconv().

* ustring::from_utf8() assuming iconv() returns with-BOM UTF-16.

I would collect Debian software packages depending libpoppler-cpp,
and check how they use ustring object. In my rough check it
would be less than 10, checking all of them would not be so
time-consuming. If there are softwares which always the skip
first character of UTF-16 (based on the assumption as the
ustring is always with UTF-16 with BOM), some discussion is
needed.

Regards,
mpsuzuki



More information about the poppler mailing list