[poppler] poppler::ustring encoding issue

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Wed Mar 14 13:41:34 UTC 2018


Dear Jeroen, Adam,

Sorry for long latency about this issue. I would try to draft
the solutions suggested by Adam.

Yet I'm not sure what I'm seeing now is same trouble with you.
In my case, the testing PDF is:
https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2ad94367/cpp/tests/HereIsUSASCII.pdf
(maybe I should provide a PDF showing surrogate characters to
clarify the difference of UTF-8 & UTF-16)
I see your testing code shows same outputs for ASCII, but
different outputs for Cyrill etc. So, the encodings by text()
and textlist() are different, although their types are same
(ustring). It should be fixed. However, US-ASCII characters
are not garbled. If it's different from the trouble you're
seeing, please let me know.

Now the easiest solution, using ustring::from_utf8() is drafted.
https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9
Please check if it works for you. I think it works well in my
environment.

I would proceed to the next one, implementing something like
ustring::from_utf8() which reflects GlobalParams::textEncoding.

Regards,
mpsuzuki


Adam Reichold wrote:
> Hello Jeroen,
> 
> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
>> <adam.reichold at t-online.de> wrote:
>>> Hello mpsuzuki,
>>>
>>> from a glance at the code, it seems page::text uses ustring::from_utf8
>>> to convert Poppler's GooString into ustring which seems correct if
>>> GlobalParams::textEncoding has its default value of "UTF-8" .
>> I don't understand this part. Why is textEncoding a global property?
>> Shouldn't this be a property of single pdf document? Is there some way
>> I can read a document's encoding from the C++ api (without including
>> GlobalParams.h).
>>
>> The pdf spec states that different strings may have different
>> encodings. Perhaps it would be possible to expose an encoding field in
>> the ustring class? If there would be a way to know the encoding of a
>> ustring, I can get the raw data and convert it to a suitable encoding
>> myself. This would be much better than making assumptions.
> 
> This is not the encoding of the text in the PDF document, but the
> encoding of the GooString that are returned by the internal Poppler API.
> Also I think the ustring class is intended to always store UTF-16
> encoded data.
> 
> Best regards, Adam.
> 
> 



More information about the poppler mailing list