[poppler] Extracting word and image position from PDF

Adrian Johnson ajohnson at redneon.com
Sat Feb 18 21:33:08 PST 2012


On 18/02/12 04:31, Albert Astals Cid wrote:
> El Dijous, 16 de febrer de 2012, a les 22:51:10, Dan Filimon va escriure:
>>>> I've been looking for ways to extract image and word positions (also
>>>> how words form sentences and paragraphs would be useful) from a PDF.
>>>> I'd like to get maps of words/images to rectangles (position, width,
>>>> height).
>>>>
>>>> Also, it would really be great if I could get the positions and
>>>> hierarchy for every object on a page (sorry about my vague terminology
>>>> when it comes to PDF, I've never worked with it). I tried looking at
>>>> the code but there don't seem to be many comments and I can't find any
>>>> documentation...
>>>>
>>>> Could you please point me in the right direction?
>>>
>>> Poppler::Page::textList seems to be what you want
>>>
>>> http://people.freedesktop.org/~aacid/docs/qt4/classPoppler_1_1Page.html#
>>> a75dea3bf58f339f224239b757b4c1bb2
>>>
>>> Albert
>>
>> Thanks for the quick reply!
>>
>> Yes, that seems to be exactly what I'm looking for, but there doesn't
>> seem to be a corresponding one for images.
>> Actually, there doesn't seem to be any dedicated image class (well,
>> besides QImage), and I can't seem to figure out how to get images from
>> a Page... I can see that there is support for rendering part of a page
>> to a QImage though.
>> I've managed to find some image generating code looking through the
>> utils/ folder in ImageOutputDev, but that seems to be using XPdf
>> directly and I can't find any documentation for that either.
> 
> From what I remember none of the "public" frontends export the Image 
> information. 

The glib frontend can export the images and their position:

http://people.freedesktop.org/~ajohnson/docs/poppler-glib/PopplerPage.html#poppler-page-get-image-mapping

> 
>>
>> Also, after having cloned the Poppler repo, I'm not sure where to look
>> first. What I gather is that there are multiple backends and frontends for
>> Poppler. Backends like Cairo, Splash and frontends like Qt4, GLib and a
>> vanilla C++ one. Which of these should I use?
> 
> The one you like better :D
> 
>> I'd kind of like minimal dependencies, but I've used Qt4 in the past
>> and liked it.
>>
>> Which of these should I look at first (and actually, how do they all
>> fit together)?
> 
> Qt4 and cpp frontends use splash backend, glib one uses cairo backend.
> 
> Albert
> 
>>
>> Sorry for being really noob-ish, but I just cant find any info :(
>>
>> Thanks!
>> Dan
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list