[poppler] poppler::ustring encoding issue

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Sun Mar 18 12:18:22 UTC 2018

Dear Albert,

please let me confirm your thought.

>> Maybe the consideration of GlobalParams::textEncoding
>> would be discussed in future when cpp frontend introduces an API
>> to modify it to non-Unicode values.
> Honestly i don't think that makes any sense, why would you want that?

do you mean that "for cpp frontend, no need to care the cases that
non-Unicode encoding is specified in GlobalParams::textEncoding" ?

if so, its reason would be "because text(), text_list(), etc return
the texts by ustring objects, thus, even if the clients can set
GlobalParams::textEncoding to preferred non-Unicode encoding, they
cannot retrieve the text in the preferred non-Unicode encoding.
therefore, no need to expose GlobalParams::setTextEncoding() via
cpp frontend" ?

if this is what you meant, I agree that no need to care the cases
that non-Unicode encoding in GlobalParams::textEncoding.

The reason why I tried to care such cases was: some utils (like
pdftotext) allow users to specify non-Unicode encoding, so I was
wondering whether something similar would be added to cpp frontend
in future. If there's no such, it's good news for me.

Sorry for lengthy confirmation!


Albert Astals Cid wrote:
> El dijous, 15 de març de 2018, a les 14:20:52 CET, suzuki toshiya va escriure:
>> Dear Albert,
>> Thank you, I'm glad to hear that one of the direction could be
>> acceptable. Maybe the consideration of GlobalParams::textEncoding
>> would be discussed in future when cpp frontend introduces an API
>> to modify it to non-Unicode values.
> Honestly i don't think that makes any sense, why would you want that?
> Cheers,
>   Albert
>> Now I'm discussing with Jeroen about how to fix other metadata
>> (not related with text_list() API), please wait a while.
>> Regards,
>> mpsuzuki
>> Albert Astals Cid wrote:
>>> El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va
>>> escriure:
>>>> Dear Adam,
>>>> The 2nd option, iconv + GlobalParams::textEncoding solution might be
>>>> something like:
>>>> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc
>>>> 456 5e01d89 ?
>>> Seems a bit too much to me.
>>> I've personally had had no time to test the other solution you sent
>>> (replacing unicode_GooString_to_ustring with from_utf8), but if that one
>>> works, it seems much simpler and straighforward and I'd like to commit
>>> that.
>>> Cheers,
>>>   Albert
>>>> Regards,
>>>> mpsuzuki
>>>> suzuki toshiya wrote:
>>>>> Oops, I'm quite sorry for my mistake which make people confused as
>>>>> if my bits are in github.com/freedesktop. The right places are:
>>>>> sample PDF file
>>>>> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp
>>>>> /t
>>>>> ests/HereIsUSASCII.pdf
>>>>> a easiest (and oversimplified) fix for this issue
>>>>> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc
>>>>> 72
>>>>> 728fc85f9
>>>>> Regards,
>>>>> mpsuzuki
>>>>> suzuki toshiya wrote:
>>>>>> Dear Jeroen, Adam,
>>>>>> Sorry for long latency about this issue. I would try to draft
>>>>>> the solutions suggested by Adam.
>>>>>> Yet I'm not sure what I'm seeing now is same trouble with you.
>>>>>> In my case, the testing PDF is:
>>>>>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba
>>>>>> 2a
>>>>>> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF
>>>>>> showing
>>>>>> surrogate characters to
>>>>>> clarify the difference of UTF-8 & UTF-16)
>>>>>> I see your testing code shows same outputs for ASCII, but
>>>>>> different outputs for Cyrill etc. So, the encodings by text()
>>>>>> and textlist() are different, although their types are same
>>>>>> (ustring). It should be fixed. However, US-ASCII characters
>>>>>> are not garbled. If it's different from the trouble you're
>>>>>> seeing, please let me know.
>>>>>> Now the easiest solution, using ustring::from_utf8() is drafted.
>>>>>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbf
>>>>>> c0
>>>>>> bc72728fc85f9 Please check if it works for you. I think it works well
>>>>>> in
>>>>>> my
>>>>>> environment.
>>>>>> I would proceed to the next one, implementing something like
>>>>>> ustring::from_utf8() which reflects GlobalParams::textEncoding.
>>>>>> Regards,
>>>>>> mpsuzuki
>>>>>> Adam Reichold wrote:
>>>>>>> Hello Jeroen,
>>>>>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
>>>>>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
>>>>>>>> <adam.reichold at t-online.de> wrote:
>>>>>>>>> Hello mpsuzuki,
>>>>>>>>> from a glance at the code, it seems page::text uses
>>>>>>>>> ustring::from_utf8
>>>>>>>>> to convert Poppler's GooString into ustring which seems correct if
>>>>>>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
>>>>>>>> I don't understand this part. Why is textEncoding a global property?
>>>>>>>> Shouldn't this be a property of single pdf document? Is there some
>>>>>>>> way
>>>>>>>> I can read a document's encoding from the C++ api (without including
>>>>>>>> GlobalParams.h).
>>>>>>>> The pdf spec states that different strings may have different
>>>>>>>> encodings. Perhaps it would be possible to expose an encoding field
>>>>>>>> in
>>>>>>>> the ustring class? If there would be a way to know the encoding of a
>>>>>>>> ustring, I can get the raw data and convert it to a suitable encoding
>>>>>>>> myself. This would be much better than making assumptions.
>>>>>>> This is not the encoding of the text in the PDF document, but the
>>>>>>> encoding of the GooString that are returned by the internal Poppler
>>>>>>> API.
>>>>>>> Also I think the ustring class is intended to always store UTF-16
>>>>>>> encoded data.
>>>>>>> Best regards, Adam.
>>>>>> _______________________________________________
>>>>>> poppler mailing list
>>>>>> poppler at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>> _______________________________________________
>>>>> poppler mailing list
>>>>> poppler at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler

More information about the poppler mailing list