[poppler] poppler::ustring encoding issue
suzuki toshiya
mpsuzuki at hiroshima-u.ac.jp
Sun Mar 25 03:39:18 UTC 2018
Hi all,
Finally I think I found the root of issue and I can propose a fix.
pre-patch situation is like this:
https://travis-ci.org/mpsuzuki/poppler/builds/357212162
post-patch situation is like this:
https://travis-ci.org/mpsuzuki/poppler/builds/357956103
My fix consists from 2 parts.
part 1)
I replaced all detail::unicode_GooString_to_ustring() by ustring::from_utf8(),
this was suggested by Adam.
https://github.com/mpsuzuki/poppler/commit/7404f5effa8e303399e5101d54ff954ee5153e44
I think this rather simple fix was already reviewed by Albert.
part 2)
UTF-16 handling needs some improvement. the issue was reported
by Jeroen.
https://github.com/mpsuzuki/poppler/commit/b3230c7098b891da0b92742264d78c9bd86750bd
2-a: PDF's document metadata /Info dict provides sometimes
ASCII (or UTF-8?) string, sometimes UTF-16 string (with BOM).
current implementation assumes it is always UTF-8. I inserted
a conditional to check BOM and deal as UTF-16 if BOM is found.
for ustring object creation by UTF-16, I added a private method
ustring::from_utf16(). I'm afraid there should be a room for
discussion about the better design of UTF-16 in ustring(),
I don't make ustring::from_utf16 at present.
2-b: the term "UTF-16" is not always same in various iconv().
"UTF-16" is sometimes regarded as UTF-16BE, sometimes as UTF-16LE.
please compare the results of "./encoding ./hello.pdf" by Linux
and Mac OS X.
Linux: https://travis-ci.org/mpsuzuki/poppler/jobs/357212163#L1106
Mac OS X: https://travis-ci.org/mpsuzuki/poppler/jobs/357212164#L2463
my fix is conditionalization of the name by WORDS_BIGENDIAN
macro (which is not newly-introduced, it was already used
in the source CairoOutputDev).
2-c: unneeded twiced sized buffer is skipped.
========================
There is a room of the discussion about the design how ustring
object store UTF-16 data.
There is a direction thinking as the buffer should include BOM
always, to clarify the byteorder.
There is another direction thinking as we should not expect the
exist of BOM (sometimes it may have, sometimes it may lack),
because the handling of BOM is more complicated than std::basic_string's
existing manipulations (like concatenation, splicing, replacing
etc).
I took the 2nd direction in my proposed fix, but other experts
think the 1st direction would be better.
In fact, I'm unfamiliar with how the cpp-frontend users think
about a BOM in ustring object. If there are so many existing
implementations assuming as if ustring always starts with a BOM
(and they have their own routines for the concatenation, splicing
and replacing), we should care for that. Please let me hear how
the users think.
Regards,
mpsuzuki
Jeroen Ooms wrote:
> Thanks everyone for the work on this issue, really appreciate the
> input. Also excited about mpsuzuki's suggestion to include font data
> with the text_list, this will be super helpful.
>
> I have updated and cleaned my example code a little bit to make it
> easier to test these issues. The updated test program and a small pdf
> file (with expected output) are now here:
> https://github.com/jeroen/popplertest . This makes it easier to update
> the example for new features/issues.
>
> It seems there are at least 3 places where encoding is not behaving as
> expected in 0.63: in the text_list and info_keys and toc. I think
> mpsuzuki's patch addresses these problems.
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
More information about the poppler
mailing list