[poppler] poppler::ustring encoding issue

Adam Reichold adam.reichold at t-online.de
Sun Mar 18 18:39:41 UTC 2018


Hello Albert,

Am 18.03.2018 um 18:52 schrieb Albert Astals Cid:
> El diumenge, 18 de març de 2018, a les 15:46:42 CET, Adam Reichold va 
> escriure:
>> Hello mpsuzuki,
>>
>> Am 18.03.2018 um 13:18 schrieb suzuki toshiya:
>>> Dear Albert,
>>>
>>> please let me confirm your thought.
>>>
>>>>> Maybe the consideration of GlobalParams::textEncoding
>>>>> would be discussed in future when cpp frontend introduces an API
>>>>> to modify it to non-Unicode values.
>>>>
>>>> Honestly i don't think that makes any sense, why would you want that?
>>>
>>> do you mean that "for cpp frontend, no need to care the cases that
>>> non-Unicode encoding is specified in GlobalParams::textEncoding" ?
>>>
>>> if so, its reason would be "because text(), text_list(), etc return
>>> the texts by ustring objects, thus, even if the clients can set
>>> GlobalParams::textEncoding to preferred non-Unicode encoding, they
>>> cannot retrieve the text in the preferred non-Unicode encoding.
>>> therefore, no need to expose GlobalParams::setTextEncoding() via
>>> cpp frontend" ?
>>>
>>> if this is what you meant, I agree that no need to care the cases
>>> that non-Unicode encoding in GlobalParams::textEncoding.
>>>
>>> The reason why I tried to care such cases was: some utils (like
>>> pdftotext) allow users to specify non-Unicode encoding, so I was
>>> wondering whether something similar would be added to cpp frontend
>>> in future. If there's no such, it's good news for me.
>>>
>>> Sorry for lengthy confirmation!
>>
>> I think you might be confusing two distinct interfaces:
>>
>> * The CPP frontend and the actual user application: There should be no
>> mentioning of GlobalParams here, since this is an internal
>> implementation detail (the we want to get rid of if at all possible) and
>> the the user application should not know or care about it. So we
>> definitely should not expose GlobalParams directly or
>> GlobalParams::setTextEncoding indirectly.
>>
>> * The internal Poppler API and the CPP frontend: The CPP frontend
>> currently assumes that GlobalParams::textEncoding is "UTF-8" which is
>> almost alright as it does not expose GlobalParams, and hence the user
>> application cannot change it and relying on the default value is fine.
>> This should only break if the default value changes (and hence the CPP
>> frontend needs to be adjusted) or the user applications circumvents the
>> CPP frontend by using the internal API directly (but this seems its own
>> fault IMHO).
> 
> Exactly, as far as the cpp frontend is concerned GlobalParams::textEncoding is 
> always "UTF-8" so that's all you need to care about.
> 
>>
>> Of course, ideally we would not have GlobalParams and the CPP frontend
>> would pass in the desired encoding everywhere text is extracted using
>> the Poppler API. 
> 
> I disagree, there's no point on letting the user of poppler choose which 
> encoding the strings should be returned, if she wants to use a different 
> encoding, she can do the conversion on the application side.

I did not mean that end the user application should decide, just the
frontend, i.e. the CPP frontend seems to have decided that it will
always present text "ustring" which is UTF-16 encoded.
Hence it would be more efficient to just request the Poppler core to
return UTF-16 encoded data within GooString instead of UTF-8 and then
converting to UTF-16 before giving it to the application.
(The part about specifying the desired encoding whenever text is
extracted is only about avoid global state as much as possible and IMHO
desirable in any case.)

Best regards, Adam.

> It is slightly different for pdftotext since that's an end user application so 
> it makes sense letting the user specify the output she wants, but for the cpp 
> API there's going to code on top of it so if further conversion is needed it 
> can be done there.
> 
> Cheers,
>   Albert
> 
>> It could then also just request UTF-16 encoding for its
>> ustring representation instead of always converting UTF-8 to UTF-16
>> before passing it to the user application.
>>
>> Best regards, Adam.
>>
>>> Regards,
>>> mpsuzuki
>>>
>>> Albert Astals Cid wrote:
>>>> El dijous, 15 de març de 2018, a les 14:20:52 CET, suzuki toshiya va
>>>>
>>>> escriure:
>>>>> Dear Albert,
>>>>>
>>>>> Thank you, I'm glad to hear that one of the direction could be
>>>>> acceptable. Maybe the consideration of GlobalParams::textEncoding
>>>>> would be discussed in future when cpp frontend introduces an API
>>>>> to modify it to non-Unicode values.
>>>>
>>>> Honestly i don't think that makes any sense, why would you want that?
>>>>
>>>> Cheers,
>>>>   Albert
>>>>
>>>>> Now I'm discussing with Jeroen about how to fix other metadata
>>>>> (not related with text_list() API), please wait a while.
>>>>>
>>>>> Regards,
>>>>> mpsuzuki
>>>>>
>>>>> Albert Astals Cid wrote:
>>>>>> El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va
>>>>>>
>>>>>> escriure:
>>>>>>> Dear Adam,
>>>>>>>
>>>>>>> The 2nd option, iconv + GlobalParams::textEncoding solution might be
>>>>>>> something like:
>>>>>>> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a
>>>>>>> 8dc
>>>>>>>
>>>>>>> 456 5e01d89 ?
>>>>>>
>>>>>> Seems a bit too much to me.
>>>>>>
>>>>>> I've personally had had no time to test the other solution you sent
>>>>>> (replacing unicode_GooString_to_ustring with from_utf8), but if that
>>>>>> one
>>>>>> works, it seems much simpler and straighforward and I'd like to commit
>>>>>> that.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>   Albert
>>>>>>
>>>>>>> Regards,
>>>>>>> mpsuzuki
>>>>>>>
>>>>>>> suzuki toshiya wrote:
>>>>>>>> Oops, I'm quite sorry for my mistake which make people confused as
>>>>>>>> if my bits are in github.com/freedesktop. The right places are:
>>>>>>>>
>>>>>>>> sample PDF file
>>>>>>>> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/
>>>>>>>> cpp
>>>>>>>>
>>>>>>>> /t
>>>>>>>> ests/HereIsUSASCII.pdf
>>>>>>>>
>>>>>>>> a easiest (and oversimplified) fix for this issue
>>>>>>>> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc
>>>>>>>> 0bc
>>>>>>>>
>>>>>>>> 72
>>>>>>>> 728fc85f9
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> mpsuzuki
>>>>>>>>
>>>>>>>> suzuki toshiya wrote:
>>>>>>>>> Dear Jeroen, Adam,
>>>>>>>>>
>>>>>>>>> Sorry for long latency about this issue. I would try to draft
>>>>>>>>> the solutions suggested by Adam.
>>>>>>>>>
>>>>>>>>> Yet I'm not sure what I'm seeing now is same trouble with you.
>>>>>>>>> In my case, the testing PDF is:
>>>>>>>>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9
>>>>>>>>> dba
>>>>>>>>>
>>>>>>>>> 2a
>>>>>>>>> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF
>>>>>>>>> showing
>>>>>>>>> surrogate characters to
>>>>>>>>> clarify the difference of UTF-8 & UTF-16)
>>>>>>>>> I see your testing code shows same outputs for ASCII, but
>>>>>>>>> different outputs for Cyrill etc. So, the encodings by text()
>>>>>>>>> and textlist() are different, although their types are same
>>>>>>>>> (ustring). It should be fixed. However, US-ASCII characters
>>>>>>>>> are not garbled. If it's different from the trouble you're
>>>>>>>>> seeing, please let me know.
>>>>>>>>>
>>>>>>>>> Now the easiest solution, using ustring::from_utf8() is drafted.
>>>>>>>>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7
>>>>>>>>> bbf
>>>>>>>>>
>>>>>>>>> c0
>>>>>>>>> bc72728fc85f9 Please check if it works for you. I think it works
>>>>>>>>> well
>>>>>>>>> in
>>>>>>>>> my
>>>>>>>>> environment.
>>>>>>>>>
>>>>>>>>> I would proceed to the next one, implementing something like
>>>>>>>>> ustring::from_utf8() which reflects GlobalParams::textEncoding.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> mpsuzuki
>>>>>>>>>
>>>>>>>>> Adam Reichold wrote:
>>>>>>>>>> Hello Jeroen,
>>>>>>>>>>
>>>>>>>>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
>>>>>>>>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
>>>>>>>>>>>
>>>>>>>>>>> <adam.reichold at t-online.de> wrote:
>>>>>>>>>>>> Hello mpsuzuki,
>>>>>>>>>>>>
>>>>>>>>>>>> from a glance at the code, it seems page::text uses
>>>>>>>>>>>> ustring::from_utf8
>>>>>>>>>>>> to convert Poppler's GooString into ustring which seems
>>>>>>>>>>>> correct if
>>>>>>>>>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
>>>>>>>>>>>
>>>>>>>>>>> I don't understand this part. Why is textEncoding a global
>>>>>>>>>>> property?
>>>>>>>>>>> Shouldn't this be a property of single pdf document? Is there some
>>>>>>>>>>> way
>>>>>>>>>>> I can read a document's encoding from the C++ api (without
>>>>>>>>>>> including
>>>>>>>>>>> GlobalParams.h).
>>>>>>>>>>>
>>>>>>>>>>> The pdf spec states that different strings may have different
>>>>>>>>>>> encodings. Perhaps it would be possible to expose an encoding
>>>>>>>>>>> field
>>>>>>>>>>> in
>>>>>>>>>>> the ustring class? If there would be a way to know the encoding
>>>>>>>>>>> of a
>>>>>>>>>>> ustring, I can get the raw data and convert it to a suitable
>>>>>>>>>>> encoding
>>>>>>>>>>> myself. This would be much better than making assumptions.
>>>>>>>>>>
>>>>>>>>>> This is not the encoding of the text in the PDF document, but the
>>>>>>>>>> encoding of the GooString that are returned by the internal Poppler
>>>>>>>>>> API.
>>>>>>>>>> Also I think the ustring class is intended to always store UTF-16
>>>>>>>>>> encoded data.
>>>>>>>>>>
>>>>>>>>>> Best regards, Adam.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> poppler mailing list
>>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> poppler mailing list
>>>>>>>> poppler at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> poppler mailing list
>>>>>>> poppler at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> 
> 
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 525 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180318/3a76ec18/attachment-0001.sig>


More information about the poppler mailing list