[poppler] Confusion about poppler_page_get_text and poppler_page_get_text_layout

Rupert Swarbrick rswarbrick at gmail.com
Tue Dec 20 04:23:59 PST 2011


Rupert Swarbrick <rswarbrick at gmail.com> writes:
> Notice the weird 80.999 / 84.149 oscillating thing. And then the sudden
> jump to the right: I wonder whether the 80/84 lines are bullet points
> and then "The Microchip" starts with the 332.999000 line? The document
> displays fine with Evince, though, and selecting the relevant text
> doesn't behave strangely.

I hunted further. I get the same behaviour with the current code from
git, so I added some g_printf () calls to that. With the document [1]
From before, the call to poppler_page_get_text_layout first outputs the
title and a couple of expected lines, then outputs a column of bullet
points.

Ahah! That explains the "oscillation" I saw before. So basically what's
going on is that the text output by poppler_page_get_text_layout is not
in the same order as that output by poppler_page_get_text.

The latter works using TextPage, rather than brute-force working through
the word list, and there seems to be clever algorithmics to put stuff in
a sensible order.

As such, I think the fact that these come out in a different order from
each other must be intentional: am I right? If so, is there currently
any way to use the glib interface to get a list of characters on the
page, along with their bounding boxes? I can't work out how to match up
the indices from poppler_page_get_text_layout with anything else.

Another point is that the relationship between the two should probably
be clarified in the documentation shipped with the source: I'll happily
provide a patch, but I can't really do that until I understand what's
going on...

Any help greatly appreciated!

Rupert

[1] http://ww1.microchip.com/downloads/en/DeviceDoc/22197B.pdf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 315 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20111220/fe2372c8/attachment.pgp>


More information about the poppler mailing list