[poppler] poppler::ustring encoding issue

Albert Astals Cid aacid at kde.org
Sun Mar 18 19:18:44 UTC 2018


El diumenge, 18 de març de 2018, a les 19:39:41 CET, Adam Reichold va 
escriure:
> Hello Albert,
> 
> Am 18.03.2018 um 18:52 schrieb Albert Astals Cid:
> > El diumenge, 18 de març de 2018, a les 15:46:42 CET, Adam Reichold va
> > 
> > escriure:
> >> Hello mpsuzuki,
> >> 
> >> Am 18.03.2018 um 13:18 schrieb suzuki toshiya:
> >>> Dear Albert,
> >>> 
> >>> please let me confirm your thought.
> >>> 
> >>>>> Maybe the consideration of GlobalParams::textEncoding
> >>>>> would be discussed in future when cpp frontend introduces an API
> >>>>> to modify it to non-Unicode values.
> >>>> 
> >>>> Honestly i don't think that makes any sense, why would you want that?
> >>> 
> >>> do you mean that "for cpp frontend, no need to care the cases that
> >>> non-Unicode encoding is specified in GlobalParams::textEncoding" ?
> >>> 
> >>> if so, its reason would be "because text(), text_list(), etc return
> >>> the texts by ustring objects, thus, even if the clients can set
> >>> GlobalParams::textEncoding to preferred non-Unicode encoding, they
> >>> cannot retrieve the text in the preferred non-Unicode encoding.
> >>> therefore, no need to expose GlobalParams::setTextEncoding() via
> >>> cpp frontend" ?
> >>> 
> >>> if this is what you meant, I agree that no need to care the cases
> >>> that non-Unicode encoding in GlobalParams::textEncoding.
> >>> 
> >>> The reason why I tried to care such cases was: some utils (like
> >>> pdftotext) allow users to specify non-Unicode encoding, so I was
> >>> wondering whether something similar would be added to cpp frontend
> >>> in future. If there's no such, it's good news for me.
> >>> 
> >>> Sorry for lengthy confirmation!
> >> 
> >> I think you might be confusing two distinct interfaces:
> >> 
> >> * The CPP frontend and the actual user application: There should be no
> >> mentioning of GlobalParams here, since this is an internal
> >> implementation detail (the we want to get rid of if at all possible) and
> >> the the user application should not know or care about it. So we
> >> definitely should not expose GlobalParams directly or
> >> GlobalParams::setTextEncoding indirectly.
> >> 
> >> * The internal Poppler API and the CPP frontend: The CPP frontend
> >> currently assumes that GlobalParams::textEncoding is "UTF-8" which is
> >> almost alright as it does not expose GlobalParams, and hence the user
> >> application cannot change it and relying on the default value is fine.
> >> This should only break if the default value changes (and hence the CPP
> >> frontend needs to be adjusted) or the user applications circumvents the
> >> CPP frontend by using the internal API directly (but this seems its own
> >> fault IMHO).
> > 
> > Exactly, as far as the cpp frontend is concerned
> > GlobalParams::textEncoding is always "UTF-8" so that's all you need to
> > care about.
> > 
> >> Of course, ideally we would not have GlobalParams and the CPP frontend
> >> would pass in the desired encoding everywhere text is extracted using
> >> the Poppler API.
> > 
> > I disagree, there's no point on letting the user of poppler choose which
> > encoding the strings should be returned, if she wants to use a different
> > encoding, she can do the conversion on the application side.
> 
> I did not mean that end the user application should decide, just the
> frontend, i.e. the CPP frontend seems to have decided that it will
> always present text "ustring" which is UTF-16 encoded.
> Hence it would be more efficient to just request the Poppler core to
> return UTF-16 encoded data within GooString instead of UTF-8 and then
> converting to UTF-16 before giving it to the application.
> (The part about specifying the desired encoding whenever text is
> extracted is only about avoid global state as much as possible and IMHO
> desirable in any case.)

Ah, ok, that makes some sense, yes, on the other hand, it means the cpp 
frontend would be using a less "used" GlobalParams::textEncoding value and 
might get unique bugs because of that, but yeah ideally we would not have bugs 
and what you suggest would be somewhat more efficient.

Cheers,
  Albert

> 
> Best regards, Adam.
> 
> > It is slightly different for pdftotext since that's an end user
> > application so it makes sense letting the user specify the output she
> > wants, but for the cpp API there's going to code on top of it so if
> > further conversion is needed it can be done there.
> > 
> > Cheers,
> > 
> >   Albert
> >> 
> >> It could then also just request UTF-16 encoding for its
> >> ustring representation instead of always converting UTF-8 to UTF-16
> >> before passing it to the user application.
> >> 
> >> Best regards, Adam.
> >> 
> >>> Regards,
> >>> mpsuzuki
> >>> 
> >>> Albert Astals Cid wrote:
> >>>> El dijous, 15 de març de 2018, a les 14:20:52 CET, suzuki toshiya va
> >>>> 
> >>>> escriure:
> >>>>> Dear Albert,
> >>>>> 
> >>>>> Thank you, I'm glad to hear that one of the direction could be
> >>>>> acceptable. Maybe the consideration of GlobalParams::textEncoding
> >>>>> would be discussed in future when cpp frontend introduces an API
> >>>>> to modify it to non-Unicode values.
> >>>> 
> >>>> Honestly i don't think that makes any sense, why would you want that?
> >>>> 
> >>>> Cheers,
> >>>> 
> >>>>   Albert
> >>>>> 
> >>>>> Now I'm discussing with Jeroen about how to fix other metadata
> >>>>> (not related with text_list() API), please wait a while.
> >>>>> 
> >>>>> Regards,
> >>>>> mpsuzuki
> >>>>> 
> >>>>> Albert Astals Cid wrote:
> >>>>>> El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya
> >>>>>> va
> >>>>>> 
> >>>>>> escriure:
> >>>>>>> Dear Adam,
> >>>>>>> 
> >>>>>>> The 2nd option, iconv + GlobalParams::textEncoding solution might be
> >>>>>>> something like:
> >>>>>>> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e0514
> >>>>>>> 7a
> >>>>>>> 8dc
> >>>>>>> 
> >>>>>>> 456 5e01d89 ?
> >>>>>> 
> >>>>>> Seems a bit too much to me.
> >>>>>> 
> >>>>>> I've personally had had no time to test the other solution you sent
> >>>>>> (replacing unicode_GooString_to_ustring with from_utf8), but if that
> >>>>>> one
> >>>>>> works, it seems much simpler and straighforward and I'd like to
> >>>>>> commit
> >>>>>> that.
> >>>>>> 
> >>>>>> Cheers,
> >>>>>> 
> >>>>>>   Albert
> >>>>>>> 
> >>>>>>> Regards,
> >>>>>>> mpsuzuki
> >>>>>>> 
> >>>>>>> suzuki toshiya wrote:
> >>>>>>>> Oops, I'm quite sorry for my mistake which make people confused as
> >>>>>>>> if my bits are in github.com/freedesktop. The right places are:
> >>>>>>>> 
> >>>>>>>> sample PDF file
> >>>>>>>> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issu
> >>>>>>>> e/
> >>>>>>>> cpp
> >>>>>>>> 
> >>>>>>>> /t
> >>>>>>>> ests/HereIsUSASCII.pdf
> >>>>>>>> 
> >>>>>>>> a easiest (and oversimplified) fix for this issue
> >>>>>>>> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bb
> >>>>>>>> fc
> >>>>>>>> 0bc
> >>>>>>>> 
> >>>>>>>> 72
> >>>>>>>> 728fc85f9
> >>>>>>>> 
> >>>>>>>> Regards,
> >>>>>>>> mpsuzuki
> >>>>>>>> 
> >>>>>>>> suzuki toshiya wrote:
> >>>>>>>>> Dear Jeroen, Adam,
> >>>>>>>>> 
> >>>>>>>>> Sorry for long latency about this issue. I would try to draft
> >>>>>>>>> the solutions suggested by Adam.
> >>>>>>>>> 
> >>>>>>>>> Yet I'm not sure what I'm seeing now is same trouble with you.
> >>>>>>>>> In my case, the testing PDF is:
> >>>>>>>>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7
> >>>>>>>>> a9
> >>>>>>>>> dba
> >>>>>>>>> 
> >>>>>>>>> 2a
> >>>>>>>>> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF
> >>>>>>>>> showing
> >>>>>>>>> surrogate characters to
> >>>>>>>>> clarify the difference of UTF-8 & UTF-16)
> >>>>>>>>> I see your testing code shows same outputs for ASCII, but
> >>>>>>>>> different outputs for Cyrill etc. So, the encodings by text()
> >>>>>>>>> and textlist() are different, although their types are same
> >>>>>>>>> (ustring). It should be fixed. However, US-ASCII characters
> >>>>>>>>> are not garbled. If it's different from the trouble you're
> >>>>>>>>> seeing, please let me know.
> >>>>>>>>> 
> >>>>>>>>> Now the easiest solution, using ustring::from_utf8() is drafted.
> >>>>>>>>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8
> >>>>>>>>> b7
> >>>>>>>>> bbf
> >>>>>>>>> 
> >>>>>>>>> c0
> >>>>>>>>> bc72728fc85f9 Please check if it works for you. I think it works
> >>>>>>>>> well
> >>>>>>>>> in
> >>>>>>>>> my
> >>>>>>>>> environment.
> >>>>>>>>> 
> >>>>>>>>> I would proceed to the next one, implementing something like
> >>>>>>>>> ustring::from_utf8() which reflects GlobalParams::textEncoding.
> >>>>>>>>> 
> >>>>>>>>> Regards,
> >>>>>>>>> mpsuzuki
> >>>>>>>>> 
> >>>>>>>>> Adam Reichold wrote:
> >>>>>>>>>> Hello Jeroen,
> >>>>>>>>>> 
> >>>>>>>>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
> >>>>>>>>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold
> >>>>>>>>>>> 
> >>>>>>>>>>> <adam.reichold at t-online.de> wrote:
> >>>>>>>>>>>> Hello mpsuzuki,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> from a glance at the code, it seems page::text uses
> >>>>>>>>>>>> ustring::from_utf8
> >>>>>>>>>>>> to convert Poppler's GooString into ustring which seems
> >>>>>>>>>>>> correct if
> >>>>>>>>>>>> GlobalParams::textEncoding has its default value of "UTF-8" .
> >>>>>>>>>>> 
> >>>>>>>>>>> I don't understand this part. Why is textEncoding a global
> >>>>>>>>>>> property?
> >>>>>>>>>>> Shouldn't this be a property of single pdf document? Is there
> >>>>>>>>>>> some
> >>>>>>>>>>> way
> >>>>>>>>>>> I can read a document's encoding from the C++ api (without
> >>>>>>>>>>> including
> >>>>>>>>>>> GlobalParams.h).
> >>>>>>>>>>> 
> >>>>>>>>>>> The pdf spec states that different strings may have different
> >>>>>>>>>>> encodings. Perhaps it would be possible to expose an encoding
> >>>>>>>>>>> field
> >>>>>>>>>>> in
> >>>>>>>>>>> the ustring class? If there would be a way to know the encoding
> >>>>>>>>>>> of a
> >>>>>>>>>>> ustring, I can get the raw data and convert it to a suitable
> >>>>>>>>>>> encoding
> >>>>>>>>>>> myself. This would be much better than making assumptions.
> >>>>>>>>>> 
> >>>>>>>>>> This is not the encoding of the text in the PDF document, but the
> >>>>>>>>>> encoding of the GooString that are returned by the internal
> >>>>>>>>>> Poppler
> >>>>>>>>>> API.
> >>>>>>>>>> Also I think the ustring class is intended to always store UTF-16
> >>>>>>>>>> encoded data.
> >>>>>>>>>> 
> >>>>>>>>>> Best regards, Adam.
> >>>>>>>>> 
> >>>>>>>>> _______________________________________________
> >>>>>>>>> poppler mailing list
> >>>>>>>>> poppler at lists.freedesktop.org
> >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> poppler mailing list
> >>>>>>>> poppler at lists.freedesktop.org
> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>>>>> 
> >>>>>>> _______________________________________________
> >>>>>>> poppler mailing list
> >>>>>>> poppler at lists.freedesktop.org
> >>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>>> 
> >>>> _______________________________________________
> >>>> poppler mailing list
> >>>> poppler at lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/poppler
> >>> 
> >>> _______________________________________________
> >>> poppler mailing list
> >>> poppler at lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/poppler
> > 
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/poppler






More information about the poppler mailing list