[poppler] poppler::ustring encoding issue

Tue Mar 6 09:31:57 UTC 2018

Hello mpsuzuki,

from a glance at the code, it seems page::text uses ustring::from_utf8
to convert Poppler's GooString into ustring which seems correct if
GlobalParams::textEncoding has its default value of "UTF-8" whereas
page::text_list uses detail::unicode_GooString_to_ustring which seems to
try to guess the source encoding based on byte order markers.

Personally, I see a few possibilities to fix things:

* Always assume GlobalParams::textEncoding == "UTF-8" for the cpp
frontend and use ustring::from_utf8.

* Implement something similar to ustring::from_utf8 based on the
capabilities of iconv and use the actual value of
GlobalParams::textEncoding to specify the source encoding.

* Adjust the Poppler core, so that the places that use
GlobalParams::textEncoding used by the cpp frontend actually take a
textEncoding parameter explicitly, so that the cpp frontend can specify
which encoding it wants (UTF-8 or even directly UTF-16 I guess).

IMHO, I think this order is one of increasing effort, but also of
increasing long-term maintainability.

Best regards, Adam.

Am 06.03.2018 um 09:00 schrieb suzuki toshiya:
> Oh, I should take a look. Do you think any change of public API
> of cpp frontend is needed?
> 
> Regards,
> mpsuzuki
> 
> On 3/6/2018 12:29 AM, Jeroen Ooms wrote:
>> A minimal example of this in a simple C++ program: https://git.io/vAQFW
>>
>> When running the example on a simple english pdf file, the
>> page->text() gets printed correctly, however the metadata fields as
>> well as words from the page->text_list() seem to get the wrong
>> encoding. What am I doing wrong here?
>>
>>
>>
>>
>> On Mon, Mar 5, 2018 at 3:10 PM, Jeroen Ooms <jeroen at berkeley.edu> wrote:
>>> I'm testing the new page::text_list() function but I run into an old
>>> problem where the conversion of the ustring to UTF-8 doesn't do what I
>>> expect:
>>>
>>>    byte_array buf = x.to_utf8();
>>>    std::string y(buf.begin(), buf.end());
>>>    const char * str = y.c_str();
>>>
>>> The resulting char * is not UTF-8. It contains random Chinese
>>> characters for pdf files with plain english ascii text. I can work
>>> around the problem by using x.to_latin1(), which gives the correct
>>> text, mostly, but obviously it doesn't work for non english text.
>>>
>>> I remember running into this before for example when reading a
>>> toc_item->title() or document->info_key() the conversion to utf8 als
>>> doesn't seem to work. Perhaps I am misunderstanding how this works. Is
>>> there some limitation on pdfs or ustrings that limits their ability to
>>> be converted to UTF-8?
>>>
>>> Somehow I am not getting this problem for ustrings from the
>>> page->text() method.
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/poppler
>>
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 525 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180306/041a5b65/attachment.sig>