[poppler] Confusion about poppler_page_get_text and poppler_page_get_text_layout
rswarbrick at gmail.com
Tue Dec 20 04:23:59 PST 2011
Rupert Swarbrick <rswarbrick at gmail.com> writes:
> Notice the weird 80.999 / 84.149 oscillating thing. And then the sudden
> jump to the right: I wonder whether the 80/84 lines are bullet points
> and then "The Microchip" starts with the 332.999000 line? The document
> displays fine with Evince, though, and selecting the relevant text
> doesn't behave strangely.
I hunted further. I get the same behaviour with the current code from
git, so I added some g_printf () calls to that. With the document 
From before, the call to poppler_page_get_text_layout first outputs the
title and a couple of expected lines, then outputs a column of bullet
Ahah! That explains the "oscillation" I saw before. So basically what's
going on is that the text output by poppler_page_get_text_layout is not
in the same order as that output by poppler_page_get_text.
The latter works using TextPage, rather than brute-force working through
the word list, and there seems to be clever algorithmics to put stuff in
a sensible order.
As such, I think the fact that these come out in a different order from
each other must be intentional: am I right? If so, is there currently
any way to use the glib interface to get a list of characters on the
page, along with their bounding boxes? I can't work out how to match up
the indices from poppler_page_get_text_layout with anything else.
Another point is that the relationship between the two should probably
be clarified in the documentation shipped with the source: I'll happily
provide a patch, but I can't really do that until I understand what's
Any help greatly appreciated!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 315 bytes
Desc: not available
More information about the poppler