[poppler] poppler::ustring encoding issue

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Wed Mar 14 17:17:47 UTC 2018


Dear Adam,

The 2nd option, iconv + GlobalParams::textEncoding solution might be
something like:
https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc4565e01d89
?

Regards,
mpsuzuki

suzuki toshiya wrote:
> Oops, I'm quite sorry for my mistake which make people confused as
> if my bits are in github.com/freedesktop. The right places are:
> 
> sample PDF file
> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/tests/HereIsUSASCII.pdf
> 
> a easiest (and oversimplified) fix for this issue
> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9
> 
> Regards,
> mpsuzuki
> 
> suzuki toshiya wrote:
>> Dear Jeroen, Adam,
>>
>> Sorry for long latency about this issue. I would try to draft
>> the solutions suggested by Adam.
>>
>> Yet I'm not sure what I'm seeing now is same trouble with you.
>> In my case, the testing PDF is:
>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2ad94367/cpp/tests/HereIsUSASCII.pdf
>> (maybe I should provide a PDF showing surrogate characters to
>> clarify the difference of UTF-8 & UTF-16)
>> I see your testing code shows same outputs for ASCII, but
>> different outputs for Cyrill etc. So, the encodings by text()
>> and textlist() are different, although their types are same
>> (ustring). It should be fixed. However, US-ASCII characters
>> are not garbled. If it's different from the trouble you're
>> seeing, please let me know.
>>
>> Now the easiest solution, using ustring::from_utf8() is drafted.
>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9
>> Please check if it works for you. I think it works well in my
>> environment.
>>
>> I would proceed to the next one, implementing something like
>> ustring::from_utf8() which reflects GlobalParams::textEncoding.
>>
>> Regards,
>> mpsuzuki
>>
>>
>> Adam Reichold wrote:
>>> Hello Jeroen,
>>>
>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
>>>> <adam.reichold at t-online.de> wrote:
>>>>> Hello mpsuzuki,
>>>>>
>>>>> from a glance at the code, it seems page::text uses ustring::from_utf8
>>>>> to convert Poppler's GooString into ustring which seems correct if
>>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
>>>> I don't understand this part. Why is textEncoding a global property?
>>>> Shouldn't this be a property of single pdf document? Is there some way
>>>> I can read a document's encoding from the C++ api (without including
>>>> GlobalParams.h).
>>>>
>>>> The pdf spec states that different strings may have different
>>>> encodings. Perhaps it would be possible to expose an encoding field in
>>>> the ustring class? If there would be a way to know the encoding of a
>>>> ustring, I can get the raw data and convert it to a suitable encoding
>>>> myself. This would be much better than making assumptions.
>>> This is not the encoding of the text in the PDF document, but the
>>> encoding of the GooString that are returned by the internal Poppler API.
>>> Also I think the ustring class is intended to always store UTF-16
>>> encoded data.
>>>
>>> Best regards, Adam.
>>>
>>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list