[poppler] poppler::ustring encoding issue

Albert Astals Cid aacid at kde.org
Sun Mar 18 11:29:42 UTC 2018


El dijous, 15 de març de 2018, a les 14:20:52 CET, suzuki toshiya va escriure:
> Dear Albert,
> 
> Thank you, I'm glad to hear that one of the direction could be
> acceptable. Maybe the consideration of GlobalParams::textEncoding
> would be discussed in future when cpp frontend introduces an API
> to modify it to non-Unicode values.

Honestly i don't think that makes any sense, why would you want that?

Cheers,
  Albert

> 
> Now I'm discussing with Jeroen about how to fix other metadata
> (not related with text_list() API), please wait a while.
> 
> Regards,
> mpsuzuki
> 
> Albert Astals Cid wrote:
> > El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va
> > 
> > escriure:
> >> Dear Adam,
> >> 
> >> The 2nd option, iconv + GlobalParams::textEncoding solution might be
> >> something like:
> >> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc
> >> 456 5e01d89 ?
> > 
> > Seems a bit too much to me.
> > 
> > I've personally had had no time to test the other solution you sent
> > (replacing unicode_GooString_to_ustring with from_utf8), but if that one
> > works, it seems much simpler and straighforward and I'd like to commit
> > that.
> > 
> > Cheers,
> > 
> >   Albert
> >> 
> >> Regards,
> >> mpsuzuki
> >> 
> >> suzuki toshiya wrote:
> >>> Oops, I'm quite sorry for my mistake which make people confused as
> >>> if my bits are in github.com/freedesktop. The right places are:
> >>> 
> >>> sample PDF file
> >>> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp
> >>> /t
> >>> ests/HereIsUSASCII.pdf
> >>> 
> >>> a easiest (and oversimplified) fix for this issue
> >>> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc
> >>> 72
> >>> 728fc85f9
> >>> 
> >>> Regards,
> >>> mpsuzuki
> >>> 
> >>> suzuki toshiya wrote:
> >>>> Dear Jeroen, Adam,
> >>>> 
> >>>> Sorry for long latency about this issue. I would try to draft
> >>>> the solutions suggested by Adam.
> >>>> 
> >>>> Yet I'm not sure what I'm seeing now is same trouble with you.
> >>>> In my case, the testing PDF is:
> >>>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba
> >>>> 2a
> >>>> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF
> >>>> showing
> >>>> surrogate characters to
> >>>> clarify the difference of UTF-8 & UTF-16)
> >>>> I see your testing code shows same outputs for ASCII, but
> >>>> different outputs for Cyrill etc. So, the encodings by text()
> >>>> and textlist() are different, although their types are same
> >>>> (ustring). It should be fixed. However, US-ASCII characters
> >>>> are not garbled. If it's different from the trouble you're
> >>>> seeing, please let me know.
> >>>> 
> >>>> Now the easiest solution, using ustring::from_utf8() is drafted.
> >>>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbf
> >>>> c0
> >>>> bc72728fc85f9 Please check if it works for you. I think it works well
> >>>> in
> >>>> my
> >>>> environment.
> >>>> 
> >>>> I would proceed to the next one, implementing something like
> >>>> ustring::from_utf8() which reflects GlobalParams::textEncoding.
> >>>> 
> >>>> Regards,
> >>>> mpsuzuki
> >>>> 
> >>>> Adam Reichold wrote:
> >>>>> Hello Jeroen,
> >>>>> 
> >>>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
> >>>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
> >>>>>> 
> >>>>>> <adam.reichold at t-online.de> wrote:
> >>>>>>> Hello mpsuzuki,
> >>>>>>> 
> >>>>>>> from a glance at the code, it seems page::text uses
> >>>>>>> ustring::from_utf8
> >>>>>>> to convert Poppler's GooString into ustring which seems correct if
> >>>>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
> >>>>>> 
> >>>>>> I don't understand this part. Why is textEncoding a global property?
> >>>>>> Shouldn't this be a property of single pdf document? Is there some
> >>>>>> way
> >>>>>> I can read a document's encoding from the C++ api (without including
> >>>>>> GlobalParams.h).
> >>>>>> 
> >>>>>> The pdf spec states that different strings may have different
> >>>>>> encodings. Perhaps it would be possible to expose an encoding field
> >>>>>> in
> >>>>>> the ustring class? If there would be a way to know the encoding of a
> >>>>>> ustring, I can get the raw data and convert it to a suitable encoding
> >>>>>> myself. This would be much better than making assumptions.
> >>>>> 
> >>>>> This is not the encoding of the text in the PDF document, but the
> >>>>> encoding of the GooString that are returned by the internal Poppler
> >>>>> API.
> >>>>> Also I think the ustring class is intended to always store UTF-16
> >>>>> encoded data.
> >>>>> 
> >>>>> Best regards, Adam.
> >>>> 
> >>>> _______________________________________________
> >>>> poppler mailing list
> >>>> poppler at lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>> 
> >>> _______________________________________________
> >>> poppler mailing list
> >>> poppler at lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >> 
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list