[poppler] poppler::ustring encoding issue

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Wed Mar 14 14:14:02 UTC 2018


Oops, I'm quite sorry for my mistake which make people confused as
if my bits are in github.com/freedesktop. The right places are:

sample PDF file
https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/tests/HereIsUSASCII.pdf

a easiest (and oversimplified) fix for this issue
https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9

Regards,
mpsuzuki

suzuki toshiya wrote:
> Dear Jeroen, Adam,
> 
> Sorry for long latency about this issue. I would try to draft
> the solutions suggested by Adam.
> 
> Yet I'm not sure what I'm seeing now is same trouble with you.
> In my case, the testing PDF is:
> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2ad94367/cpp/tests/HereIsUSASCII.pdf
> (maybe I should provide a PDF showing surrogate characters to
> clarify the difference of UTF-8 & UTF-16)
> I see your testing code shows same outputs for ASCII, but
> different outputs for Cyrill etc. So, the encodings by text()
> and textlist() are different, although their types are same
> (ustring). It should be fixed. However, US-ASCII characters
> are not garbled. If it's different from the trouble you're
> seeing, please let me know.
> 
> Now the easiest solution, using ustring::from_utf8() is drafted.
> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9
> Please check if it works for you. I think it works well in my
> environment.
> 
> I would proceed to the next one, implementing something like
> ustring::from_utf8() which reflects GlobalParams::textEncoding.
> 
> Regards,
> mpsuzuki
> 
> 
> Adam Reichold wrote:
>> Hello Jeroen,
>>
>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
>>> <adam.reichold at t-online.de> wrote:
>>>> Hello mpsuzuki,
>>>>
>>>> from a glance at the code, it seems page::text uses ustring::from_utf8
>>>> to convert Poppler's GooString into ustring which seems correct if
>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
>>> I don't understand this part. Why is textEncoding a global property?
>>> Shouldn't this be a property of single pdf document? Is there some way
>>> I can read a document's encoding from the C++ api (without including
>>> GlobalParams.h).
>>>
>>> The pdf spec states that different strings may have different
>>> encodings. Perhaps it would be possible to expose an encoding field in
>>> the ustring class? If there would be a way to know the encoding of a
>>> ustring, I can get the raw data and convert it to a suitable encoding
>>> myself. This would be much better than making assumptions.
>> This is not the encoding of the text in the PDF document, but the
>> encoding of the GooString that are returned by the internal Poppler API.
>> Also I think the ustring class is intended to always store UTF-16
>> encoded data.
>>
>> Best regards, Adam.
>>
>>
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list