[poppler] poppler::ustring encoding issue

Albert Astals Cid aacid at kde.org
Wed Mar 14 23:04:48 UTC 2018


El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va 
escriure:
> Dear Adam,
> 
> The 2nd option, iconv + GlobalParams::textEncoding solution might be
> something like:
> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc456
> 5e01d89 ?

Seems a bit too much to me.

I've personally had had no time to test the other solution you sent (replacing 
unicode_GooString_to_ustring with from_utf8), but if that one works, it seems 
much simpler and straighforward and I'd like to commit that.

Cheers,
  Albert

> 
> Regards,
> mpsuzuki
> 
> suzuki toshiya wrote:
> > Oops, I'm quite sorry for my mistake which make people confused as
> > if my bits are in github.com/freedesktop. The right places are:
> > 
> > sample PDF file
> > https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/t
> > ests/HereIsUSASCII.pdf
> > 
> > a easiest (and oversimplified) fix for this issue
> > https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72
> > 728fc85f9
> > 
> > Regards,
> > mpsuzuki
> > 
> > suzuki toshiya wrote:
> >> Dear Jeroen, Adam,
> >> 
> >> Sorry for long latency about this issue. I would try to draft
> >> the solutions suggested by Adam.
> >> 
> >> Yet I'm not sure what I'm seeing now is same trouble with you.
> >> In my case, the testing PDF is:
> >> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2a
> >> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF showing
> >> surrogate characters to
> >> clarify the difference of UTF-8 & UTF-16)
> >> I see your testing code shows same outputs for ASCII, but
> >> different outputs for Cyrill etc. So, the encodings by text()
> >> and textlist() are different, although their types are same
> >> (ustring). It should be fixed. However, US-ASCII characters
> >> are not garbled. If it's different from the trouble you're
> >> seeing, please let me know.
> >> 
> >> Now the easiest solution, using ustring::from_utf8() is drafted.
> >> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0
> >> bc72728fc85f9 Please check if it works for you. I think it works well in
> >> my
> >> environment.
> >> 
> >> I would proceed to the next one, implementing something like
> >> ustring::from_utf8() which reflects GlobalParams::textEncoding.
> >> 
> >> Regards,
> >> mpsuzuki
> >> 
> >> Adam Reichold wrote:
> >>> Hello Jeroen,
> >>> 
> >>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
> >>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
> >>>> 
> >>>> <adam.reichold at t-online.de> wrote:
> >>>>> Hello mpsuzuki,
> >>>>> 
> >>>>> from a glance at the code, it seems page::text uses ustring::from_utf8
> >>>>> to convert Poppler's GooString into ustring which seems correct if
> >>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
> >>>> 
> >>>> I don't understand this part. Why is textEncoding a global property?
> >>>> Shouldn't this be a property of single pdf document? Is there some way
> >>>> I can read a document's encoding from the C++ api (without including
> >>>> GlobalParams.h).
> >>>> 
> >>>> The pdf spec states that different strings may have different
> >>>> encodings. Perhaps it would be possible to expose an encoding field in
> >>>> the ustring class? If there would be a way to know the encoding of a
> >>>> ustring, I can get the raw data and convert it to a suitable encoding
> >>>> myself. This would be much better than making assumptions.
> >>> 
> >>> This is not the encoding of the text in the PDF document, but the
> >>> encoding of the GooString that are returned by the internal Poppler API.
> >>> Also I think the ustring class is intended to always store UTF-16
> >>> encoded data.
> >>> 
> >>> Best regards, Adam.
> >> 
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/poppler
> > 
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list