[poppler] c++ ustring encoding still completely broken

Mon Dec 3 16:35:55 UTC 2018

On Sun, Dec 2, 2018 at 12:51 PM Adam Reichold <adam.reichold at t-online.de> wrote:
>
> Hello,
>
> Am 02.12.18 um 00:06 schrieb Albert Astals Cid:
> > El dissabte, 1 de desembre de 2018, a les 23:20:46 CET, Jeroen Ooms va escriure:
> >> I maintain the poppler bindings for the R programming language and get
> >> a lot of bug reports about corrupted text extracted with poppler.
> >> Below a minimal example that illustrates the problem:
> >>
> >>   git clone https://github.com/jeroen/popplertest
> >>   cd popplertest
> >>   g++ -std=c++11 encoding.cpp -o encoding $(pkg-config --cflags --libs
> >> poppler-cpp)
> >>   ./encoding hello.pdf
> >>
> >> The output shows a lot of Chinese characters which is incorrect (all
> >> text in the pdf is english).
> >>
> >> Back in March 2018, Suzuki Toshiya had posted a patch with at least a
> >> partial solution:
> >> https://lists.freedesktop.org/archives/poppler/2018-March/012962.html
> >> . I hope we can revisit this.
> >
> > Can someone please post a patch to the new gitlab merge requests? It's muuuuuch easier to keep track of what needs reviewing if we have it all there.
>
> Created !129 [1]. Probably a big improvement but I am not completely
> convinced that this is all there is to do.

Thank you for reviving this!

I tested your branch with my example program [1] and I can confirm it
now extracts the correct text for all my example pdf files. I have
tested both with plain english documents as well as pdf files with
Chinese text. Indeed I am not sure this covers every edge case of rare
pdf encoding obscurities, but it is already an enormous improvement
over the current situation (in which ustrings contain only gibberish).
Hopefully this can be merged soon and we can tweak details in further
iterations.

[1] https://github.com/jeroen/popplertest/blob/master/encoding.cpp