[poppler] poppler::ustring encoding issue
suzuki toshiya
mpsuzuki at hiroshima-u.ac.jp
Wed Mar 28 12:26:39 UTC 2018
Dear Adam,
Adam Reichold wrote:
>> I see. where is the appropriate place to add a document of
>> poppler::ustring class itself?
>
> Personally, I would suggest Doxygen comments in the public header.
Thanks! Now I'm trying to write... also I found Doxygen comments
for text_list needs the improvement.
During the check of the existing functions (to add documents),
I found a few inconsistencies about BOM.
* ustring::to_latin1() this function does not use iconv(),
this function just cast the types between unsigned short and
char. BOM could not be converted to Latin-1, but the exist of
BOM is not checked. if stored UTF-16 has a BOM, broken 8bit
would be inserted in the beginning of the result.
* ustring::from_latin1() this function does not use iconv()
either. BOM is not inserted to the beginning. no-BOM UTF-16
string is created.
* ustring::to_utf8() BOM or no-BOM is decided by iconv().
* ustring::from_utf8() assuming iconv() returns with-BOM UTF-16.
I would collect Debian software packages depending libpoppler-cpp,
and check how they use ustring object. In my rough check it
would be less than 10, checking all of them would not be so
time-consuming. If there are softwares which always the skip
first character of UTF-16 (based on the assumption as the
ustring is always with UTF-16 with BOM), some discussion is
needed.
Regards,
mpsuzuki
More information about the poppler
mailing list