[poppler] Extracting word and image position from PDF (Albert Astals Cid)

Thu Feb 16 12:51:10 PST 2012

>> I've been looking for ways to extract image and word positions (also
>> how words form sentences and paragraphs would be useful) from a PDF.
>> I'd like to get maps of words/images to rectangles (position, width,
>> height).
>>
>> Also, it would really be great if I could get the positions and
>> hierarchy for every object on a page (sorry about my vague terminology
>> when it comes to PDF, I've never worked with it). I tried looking at
>> the code but there don't seem to be many comments and I can't find any
>> documentation...
>>
>> Could you please point me in the right direction?
>
> Poppler::Page::textList seems to be what you want
>
> http://people.freedesktop.org/~aacid/docs/qt4/classPoppler_1_1Page.html#a75dea3bf58f339f224239b757b4c1bb2
>
> Albert

Thanks for the quick reply!

Yes, that seems to be exactly what I'm looking for, but there doesn't
seem to be a corresponding one for images.
Actually, there doesn't seem to be any dedicated image class (well,
besides QImage), and I can't seem to figure out how to get images from
a Page... I can see that there is support for rendering part of a page
to a QImage though.
I've managed to find some image generating code looking through the
utils/ folder in ImageOutputDev, but that seems to be using XPdf
directly and I can't find any documentation for that either.

Also, after having cloned the Poppler repo, I'm not sure where to look first.
What I gather is that there are multiple backends and frontends for
Poppler. Backends like Cairo, Splash and frontends like Qt4, GLib and
a vanilla C++ one. Which of these should I use?
I'd kind of like minimal dependencies, but I've used Qt4 in the past
and liked it.

Which of these should I look at first (and actually, how do they all
fit together)?

Sorry for being really noob-ish, but I just cant find any info :(

Thanks!
Dan