[poppler] poppler::ustring encoding issue

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Thu Mar 15 13:20:52 UTC 2018


Dear Albert,

Thank you, I'm glad to hear that one of the direction could be
acceptable. Maybe the consideration of GlobalParams::textEncoding
would be discussed in future when cpp frontend introduces an API
to modify it to non-Unicode values.

Now I'm discussing with Jeroen about how to fix other metadata
(not related with text_list() API), please wait a while.

Regards,
mpsuzuki

Albert Astals Cid wrote:
> El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va 
> escriure:
>> Dear Adam,
>>
>> The 2nd option, iconv + GlobalParams::textEncoding solution might be
>> something like:
>> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc456
>> 5e01d89 ?
> 
> Seems a bit too much to me.
> 
> I've personally had had no time to test the other solution you sent (replacing 
> unicode_GooString_to_ustring with from_utf8), but if that one works, it seems 
> much simpler and straighforward and I'd like to commit that.
> 
> Cheers,
>   Albert
> 
>> Regards,
>> mpsuzuki
>>
>> suzuki toshiya wrote:
>>> Oops, I'm quite sorry for my mistake which make people confused as
>>> if my bits are in github.com/freedesktop. The right places are:
>>>
>>> sample PDF file
>>> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/t
>>> ests/HereIsUSASCII.pdf
>>>
>>> a easiest (and oversimplified) fix for this issue
>>> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72
>>> 728fc85f9
>>>
>>> Regards,
>>> mpsuzuki
>>>
>>> suzuki toshiya wrote:
>>>> Dear Jeroen, Adam,
>>>>
>>>> Sorry for long latency about this issue. I would try to draft
>>>> the solutions suggested by Adam.
>>>>
>>>> Yet I'm not sure what I'm seeing now is same trouble with you.
>>>> In my case, the testing PDF is:
>>>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2a
>>>> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF showing
>>>> surrogate characters to
>>>> clarify the difference of UTF-8 & UTF-16)
>>>> I see your testing code shows same outputs for ASCII, but
>>>> different outputs for Cyrill etc. So, the encodings by text()
>>>> and textlist() are different, although their types are same
>>>> (ustring). It should be fixed. However, US-ASCII characters
>>>> are not garbled. If it's different from the trouble you're
>>>> seeing, please let me know.
>>>>
>>>> Now the easiest solution, using ustring::from_utf8() is drafted.
>>>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0
>>>> bc72728fc85f9 Please check if it works for you. I think it works well in
>>>> my
>>>> environment.
>>>>
>>>> I would proceed to the next one, implementing something like
>>>> ustring::from_utf8() which reflects GlobalParams::textEncoding.
>>>>
>>>> Regards,
>>>> mpsuzuki
>>>>
>>>> Adam Reichold wrote:
>>>>> Hello Jeroen,
>>>>>
>>>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
>>>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
>>>>>>
>>>>>> <adam.reichold at t-online.de> wrote:
>>>>>>> Hello mpsuzuki,
>>>>>>>
>>>>>>> from a glance at the code, it seems page::text uses ustring::from_utf8
>>>>>>> to convert Poppler's GooString into ustring which seems correct if
>>>>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
>>>>>> I don't understand this part. Why is textEncoding a global property?
>>>>>> Shouldn't this be a property of single pdf document? Is there some way
>>>>>> I can read a document's encoding from the C++ api (without including
>>>>>> GlobalParams.h).
>>>>>>
>>>>>> The pdf spec states that different strings may have different
>>>>>> encodings. Perhaps it would be possible to expose an encoding field in
>>>>>> the ustring class? If there would be a way to know the encoding of a
>>>>>> ustring, I can get the raw data and convert it to a suitable encoding
>>>>>> myself. This would be much better than making assumptions.
>>>>> This is not the encoding of the text in the PDF document, but the
>>>>> encoding of the GooString that are returned by the internal Poppler API.
>>>>> Also I think the ustring class is intended to always store UTF-16
>>>>> encoded data.
>>>>>
>>>>> Best regards, Adam.
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> 
> 
> 



More information about the poppler mailing list