[poppler] Confusion about poppler_page_get_text and poppler_page_get_text_layout

Rupert Swarbrick rswarbrick at gmail.com
Mon Dec 19 16:56:47 PST 2011


I've spent a little more time with the question, and made a C-based
tester to avoid any influence from lispyness. I've attached the code
below.

On a PDF of the source code for the tester, which I've also attached, I
get the expected output (I think). Both lengths are 1544, and the
numbers increase in an obvious manner.

However, the Microchip datasheet on which I was testing my code still
fails weirdly. You can get it from [1] (not sure about whether I should
be posting it to a mailing list) and the first few lines I get look
like:

String length: 1541
Rectangles:    1477

M | #<RECTANGLE (258.839000  68.019700) (288.823400 108.195700)>
C | #<RECTANGLE (288.823400  68.019700) (314.811800 108.195700)>
P | #<RECTANGLE (314.811800  68.019700) (338.820200 108.195700)>
6 | #<RECTANGLE (338.820200  68.019700) (358.832600 108.195700)>
3 | #<RECTANGLE (358.832600  68.019700) (378.845000 108.195700)>
1 | #<RECTANGLE (378.845000  68.019700) (398.857400 108.195700)>
/ | #<RECTANGLE (398.857400  68.019700) (408.861800 108.195700)>
2 | #<RECTANGLE (408.861800  68.019700) (428.874200 108.195700)>
/ | #<RECTANGLE (428.874200  68.019700) (438.878600 108.195700)>
3 | #<RECTANGLE (438.878600  68.019700) (458.891000 108.195700)>
/ | #<RECTANGLE (458.891000  68.019700) (468.895400 108.195700)>
4 | #<RECTANGLE (468.895400  68.019700) (488.907800 108.195700)>
/ | #<RECTANGLE (488.907800  68.019700) (498.912200 108.195700)>
5 | #<RECTANGLE (498.912200  68.019700) (518.924600 108.195700)>

<snip some unsurprising lines>

D | #<RECTANGLE (332.999000 159.477100) (340.933472 171.663820)>
e | #<RECTANGLE (340.933472 159.477100) (347.055224 171.663820)>
s | #<RECTANGLE (347.055224 159.477100) (353.176976 171.663820)>
c | #<RECTANGLE (353.176976 159.477100) (359.298728 171.663820)>
r | #<RECTANGLE (359.298728 159.477100) (363.497468 171.663820)>
i | #<RECTANGLE (363.497468 159.477100) (366.583460 171.663820)>
p | #<RECTANGLE (366.583460 159.477100) (373.305812 171.663820)>
t | #<RECTANGLE (373.305812 159.477100) (376.906136 171.663820)>
i | #<RECTANGLE (376.906136 159.477100) (380.028164 171.663820)>
o | #<RECTANGLE (380.028164 159.477100) (386.750516 171.663820)>
n | #<RECTANGLE (386.750516 159.477100) (393.472868 171.663820)>
: | #<RECTANGLE (393.472868 159.477100) (397.109228 171.663820)>

 | #<RECTANGLE (397.109228 171.663820) (397.109228 171.663820)>
- | #<RECTANGLE ( 80.999100 178.854700) ( 84.149100 188.898700)>
- | #<RECTANGLE ( 84.149100 188.898700) ( 84.149100 188.898700)>
- | #<RECTANGLE ( 80.999100 191.934400) ( 84.149100 201.978400)>
  | #<RECTANGLE ( 84.149100 201.978400) ( 84.149100 201.978400)>
T | #<RECTANGLE ( 80.999100 204.894400) ( 84.149100 214.938400)>
h | #<RECTANGLE ( 84.149100 214.938400) ( 84.149100 214.938400)>
e | #<RECTANGLE ( 80.999100 217.854400) ( 84.149100 227.898400)>
  | #<RECTANGLE ( 84.149100 227.898400) ( 84.149100 227.898400)>
M | #<RECTANGLE ( 80.999100 230.934100) ( 84.149100 240.978100)>
i | #<RECTANGLE ( 84.149100 240.978100) ( 84.149100 240.978100)>
c | #<RECTANGLE ( 80.999100 243.894100) ( 84.149100 253.938100)>
r | #<RECTANGLE ( 84.149100 253.938100) ( 84.149100 253.938100)>
o | #<RECTANGLE ( 80.999100 256.854100) ( 84.149100 266.898100)>
c | #<RECTANGLE ( 84.149100 266.898100) ( 84.149100 266.898100)>
h | #<RECTANGLE ( 80.999100 269.933800) ( 84.149100 279.977800)>
i | #<RECTANGLE ( 84.149100 279.977800) ( 84.149100 279.977800)>
p | #<RECTANGLE (332.999000 178.854700) (338.530400 188.898700)>
  | #<RECTANGLE (338.530400 178.854700) (343.448900 188.898700)>
T | #<RECTANGLE (343.448900 178.854700) (348.452900 188.898700)>
e | #<RECTANGLE (348.452900 178.854700) (350.406800 188.898700)>
c | #<RECTANGLE (350.406800 178.854700) (357.936200 188.898700)>
h | #<RECTANGLE (357.936200 178.854700) (359.855000 188.898700)>
n | #<RECTANGLE (359.855000 178.854700) (364.387400 188.898700)>
o | #<RECTANGLE (364.387400 178.854700) (367.416800 188.898700)>
l | #<RECTANGLE (367.416800 178.854700) (372.335300 188.898700)>
o | #<RECTANGLE (372.335300 178.854700) (376.895600 188.898700)>
g | #<RECTANGLE (376.895600 178.854700) (381.932000 188.898700)>
y | #<RECTANGLE (381.932000 178.854700) (383.850800 188.898700)>
, | #<RECTANGLE (383.850800 178.854700) (388.854800 188.898700)>
  | #<RECTANGLE (388.854800 178.854700) (390.808700 188.898700)>
I | #<RECTANGLE (390.808700 178.854700) (395.369900 188.898700)>


Notice the weird 80.999 / 84.149 oscillating thing. And then the sudden
jump to the right: I wonder whether the 80/84 lines are bullet points
and then "The Microchip" starts with the 332.999000 line? The document
displays fine with Evince, though, and selecting the relevant text
doesn't behave strangely.

I'm compiling this against the Debian package, version
0.167.7-2+b1. I'll try against a git master version next, I think.

So I'm still confused, but at least I'm now certain it's not Lisp. Has
anyone any ideas?


Rupert



[1] http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tmp.pdf
Type: application/pdf
Size: 4837 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20111220/5da6e244/attachment.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: popplertest.c
Type: text/x-csrc
Size: 1932 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20111220/5da6e244/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 315 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20111220/5da6e244/attachment.pgp>


More information about the poppler mailing list