[poppler] Incompatible number of glyphs from glib get_text{, layout}

Peter Waller peter at scraperwiki.com
Tue May 26 04:53:57 PDT 2015


I've made a bit (punintended - you'll see) of progress.

The document in question: http://pwaller.net/sw/2014-01-17-broken.pdf

If you try to copy paste *all the text* from this document you find it
is truncated after the top right part. If you try to copy paste just
the bottom left, it is possible.

There is some text which doesn't appear unless you select it
("Swartzville, etc"). It cannot be pasted into an external
application, instead it gives a few control codes (much shorter than
the selected string).

I instrumented TextSelectionDumper::getText and TextPage::dumpFragment
to look at the full width (int representation) unicode characters as
they were being appended to `s`, and I see the following (see end of
message).

It is interesting to note that A) characters which appear lower case
(and punctuation) visually in the document come out upper case (bit 6
flipped) B) upper case characters come out with bits 6 and 7 flipped.

I'm going down the rabbit hole here, so any advice appreciated. Where
might this brokenness be entering? It seems that it is being rendered
correctly so my intuition is that the correct string must be available
somehow.

Thanks!

- Peter

Line 32 nwords=1
0x17 c=X
0x16 c=X
0x10 c=X
0x00 c=X
0x37 c=7
0x0e c=X
0x00 c=X
0x33 c=3
0x57 c=W
0x41 c=A
0x52 c=R
0x54 c=T
0x5a c=Z
0x56 c=V
0x49 c=I
0x4c c=L
0x4c c=L
0x45 c=E
0x00 c=X
0x32 c=2
0x44 c=D
0x0e c=X

Line 33 nwords=1
0x00 c=X
0x00 c=X
0x00 c=X
0x32 c=2
0x45 c=E
0x49 c=I
0x4e c=N
0x48 c=H
0x4f c=O
0x4c c=L
0x44 c=D
0x53 c=S
0x0c c=X
0x00 c=X
0x30 c=0
0x21 c=!
0x00 c=X
0x11 c=X
0x17 c=X
0x15 c=X
0x16 c=X
0x19 c=X


More information about the poppler mailing list