[poppler] poppler::ustring encoding issue

Albert Astals Cid aacid at kde.org
Sun May 6 21:36:37 UTC 2018


El dijous, 12 d’abril de 2018, a les 10:33:33 CEST, suzuki toshiya va 
escriure:
> Dear Jeroen,
> 
> Please let me prepare some data for regression test.
> The data I've tested are mainly ASCII or UTF-16BE data.
> I should check PDFEncoding data cases (if anybody already has something
> appropriate, please let me know).

I have aroun1 1700 pdf here collected from random bugs so if you give me a 
patch and a test/regression program that outputs something that can be diff'ed 
i can "easily" compare the before and after.

Cheers,
  Albert

> 
> Regards,
> mpsuzuki
> 
> Jeroen Ooms wrote:
> > FYI the encoding problems still exist in the master branch today. I am
> > very interested in this patch by mpsuzuki, what can we do to move this
> > forward?
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Wed, Mar 28, 2018 at 2:26 PM, suzuki toshiya
> > 
> > <mpsuzuki at hiroshima-u.ac.jp> wrote:
> >> Dear Adam,
> >> 
> >> Adam Reichold wrote:
> >>>> I see. where is the appropriate place to add a document of
> >>>> poppler::ustring class itself?
> >>> 
> >>> Personally, I would suggest Doxygen comments in the public header.
> >> 
> >> Thanks! Now I'm trying to write... also I found Doxygen comments
> >> for text_list needs the improvement.
> >> 
> >> During the check of the existing functions (to add documents),
> >> I found a few inconsistencies about BOM.
> >> 
> >> * ustring::to_latin1() this function does not use iconv(),
> >> this function just cast the types between unsigned short and
> >> char. BOM could not be converted to Latin-1, but the exist of
> >> BOM is not checked. if stored UTF-16 has a BOM, broken 8bit
> >> would be inserted in the beginning of the result.
> >> 
> >> * ustring::from_latin1() this function does not use iconv()
> >> either. BOM is not inserted to the beginning. no-BOM UTF-16
> >> string is created.
> >> 
> >> * ustring::to_utf8() BOM or no-BOM is decided by iconv().
> >> 
> >> * ustring::from_utf8() assuming iconv() returns with-BOM UTF-16.
> >> 
> >> I would collect Debian software packages depending libpoppler-cpp,
> >> and check how they use ustring object. In my rough check it
> >> would be less than 10, checking all of them would not be so
> >> time-consuming. If there are softwares which always the skip
> >> first character of UTF-16 (based on the assumption as the
> >> ustring is always with UTF-16 with BOM), some discussion is
> >> needed.
> >> 
> >> Regards,
> >> mpsuzuki
> >> 
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list