[poppler] poppler::ustring encoding issue

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Tue Mar 6 08:00:08 UTC 2018


Oh, I should take a look. Do you think any change of public API
of cpp frontend is needed?

Regards,
mpsuzuki

On 3/6/2018 12:29 AM, Jeroen Ooms wrote:
> A minimal example of this in a simple C++ program: https://git.io/vAQFW
> 
> When running the example on a simple english pdf file, the
> page->text() gets printed correctly, however the metadata fields as
> well as words from the page->text_list() seem to get the wrong
> encoding. What am I doing wrong here?
> 
> 
> 
> 
> On Mon, Mar 5, 2018 at 3:10 PM, Jeroen Ooms <jeroen at berkeley.edu> wrote:
>> I'm testing the new page::text_list() function but I run into an old
>> problem where the conversion of the ustring to UTF-8 doesn't do what I
>> expect:
>>
>>    byte_array buf = x.to_utf8();
>>    std::string y(buf.begin(), buf.end());
>>    const char * str = y.c_str();
>>
>> The resulting char * is not UTF-8. It contains random Chinese
>> characters for pdf files with plain english ascii text. I can work
>> around the problem by using x.to_latin1(), which gives the correct
>> text, mostly, but obviously it doesn't work for non english text.
>>
>> I remember running into this before for example when reading a
>> toc_item->title() or document->info_key() the conversion to utf8 als
>> doesn't seem to work. Perhaps I am misunderstanding how this works. Is
>> there some limitation on pdfs or ustrings that limits their ability to
>> be converted to UTF-8?
>>
>> Somehow I am not getting this problem for ustrings from the page->text() method.
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
> 



More information about the poppler mailing list