[poppler] poppler::ustring encoding issue

Albert Astals Cid aacid at kde.org
Mon Mar 26 20:06:27 UTC 2018


El diumenge, 25 de març de 2018, a les 5:39:18 CEST, suzuki toshiya va 
escriure:
> Hi all,
> 
> Finally I think I found the root of issue and I can propose a fix.
> pre-patch situation is like this:
> https://travis-ci.org/mpsuzuki/poppler/builds/357212162
> 
> post-patch situation is like this:
> https://travis-ci.org/mpsuzuki/poppler/builds/357956103
> 
> My fix consists from 2 parts.

Can you post the patches either to bugzilla or as attachments to the mailing 
list? I don't feel confortable using github.

> 
> part 1)
> I replaced all detail::unicode_GooString_to_ustring() by
> ustring::from_utf8(), this was suggested by Adam.
> 
> https://github.com/mpsuzuki/poppler/commit/7404f5effa8e303399e5101d54ff954ee
> 5153e44
> 
> I think this rather simple fix was already reviewed by Albert.
> 
> part 2)
> UTF-16 handling needs some improvement. the issue was reported
> by Jeroen.
> 
> https://github.com/mpsuzuki/poppler/commit/b3230c7098b891da0b92742264d78c9bd
> 86750bd
> 
> 2-a: PDF's document metadata /Info dict provides sometimes
> ASCII (or UTF-8?) string, sometimes UTF-16 string (with BOM).
> current implementation assumes it is always UTF-8. I inserted
> a conditional to check BOM and deal as UTF-16 if BOM is found.
> for ustring object creation by UTF-16, I added a private method
> ustring::from_utf16(). I'm afraid there should be a room for
> discussion about the better design of UTF-16 in ustring(),
> I don't make ustring::from_utf16 at present.
> 
> 2-b: the term "UTF-16" is not always same in various iconv().
> "UTF-16" is sometimes regarded as UTF-16BE, sometimes as UTF-16LE.
> please compare the results of "./encoding ./hello.pdf" by Linux
> and Mac OS X.
> Linux: https://travis-ci.org/mpsuzuki/poppler/jobs/357212163#L1106
> Mac OS X: https://travis-ci.org/mpsuzuki/poppler/jobs/357212164#L2463
> 
> my fix is conditionalization of the name by WORDS_BIGENDIAN
> macro (which is not newly-introduced, it was already used
> in the source CairoOutputDev).
> 
> 2-c: unneeded twiced sized buffer is skipped.
> 
> ========================
> 
> There is a room of the discussion about the design how ustring
> object store UTF-16 data.
> 
> There is a direction thinking as the buffer should include BOM
> always, to clarify the byteorder.
> 
> There is another direction thinking as we should not expect the
> exist of BOM (sometimes it may have, sometimes it may lack),
> because the handling of BOM is more complicated than std::basic_string's
> existing manipulations (like concatenation, splicing, replacing
> etc).
> 
> I took the 2nd direction in my proposed fix, but other experts
> think the 1st direction would be better.
> 
> In fact, I'm unfamiliar with how the cpp-frontend users think
> about a BOM in ustring object. If there are so many existing
> implementations assuming as if ustring always starts with a BOM
> (and they have their own routines for the concatenation, splicing
> and replacing), we should care for that. Please let me hear how
> the users think.

I guess this is not very important, I mean after all it was broken so i guess 
noone could make it really work?

what we need to do is document exactly how it behaves (since it seems to be a 
bit under documented now).

Cheers,
  Albert

> 
> Regards,
> mpsuzuki
> 
> Jeroen Ooms wrote:
> > Thanks everyone for the work on this issue, really appreciate the
> > input. Also excited about mpsuzuki's suggestion to include font data
> > with the text_list, this will be super helpful.
> > 
> > I have updated and cleaned my example code a little bit to make it
> > easier to test these issues. The updated test program and a small pdf
> > file (with expected output) are now here:
> > https://github.com/jeroen/popplertest . This makes it easier to update
> > the example for new features/issues.
> > 
> > It seems there are at least 3 places where encoding is not behaving as
> > expected in 0.63: in the text_list and info_keys and toc. I think
> > mpsuzuki's patch addresses these problems.
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list