[poppler] Retrieve all objects from a PDF file

Nedim Srndic nedim.sh at gmail.com
Wed Nov 2 06:00:39 PDT 2011


I am doing some research on the structure of PDF files. I wrote a
utility to convert the object (i.e., dictionary) structure of PDFs into
XML so that I can query the structure using XPath or similar query
languages. I also care about the context, and the context can be rebuilt
from the resulting XML when necessary. 

Nedim

On Tue, 2011-11-01 at 05:26 -0700, Leonard Rosenthol wrote:
> Why would you iterate over the objects w/o any understanding of their
> context?  Wouldn't it make MUCH MORE sense to "walk the tree" - starting
> at the Catalog/Root and then simply recursing down the object tree based
> on known relationships?
> 
> What use are the objects w/o context?
> 
> Leonard
> 
> On 11/1/11 7:55 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:
> 
> >I'm sorry, I see now that I wasn't clear enough. I would like to
> >enumerate every PDF dictionary from a given PDF file, including but not
> >limited to the Catalog, Pages, Actions, Annotations, Name tree -
> >everything. Currently I can successfully do that for all dictionaries
> >that can be located using XRef, but it seems that indirect objects
> >inside object streams cannot be found this way. I could obviously test
> >if any of the objects pointed to by the XRef is an object stream and get
> >all the objects from the stream, but I'm wondering if Poppler has a more
> >elegant solution. 
> >
> >Nedim
> >
> >On Mon, 2011-10-31 at 11:12 -0700, Josh Richardson wrote:
> >> What kinds of objects are you interested in?  I have a version of
> >> pdftohtml which I believe is not yet merged into the master repo that
> >> extracts images and fonts.
> >> 
> >> --josh
> >> 
> >> On 10/31/11 9:16 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:
> >> 
> >> >Dear list, 
> >> >
> >> >I am using the Poppler library (in the src/poppler folder, no bindings,
> >> >version 7 from the Ubuntu 10.10 repos) and would like to retrieve all
> >> >objects from a PDF file. Currently, I am running a loop on XRef and
> >> >getting all the non-null objects from it, but it doesn't seem to
> >> >retrieve objects from object streams. What solution would you propose
> >> >for this problem?
> >> >
> >> >Thanks, 
> >> >Nedim Srndic
> >> >
> >> >_______________________________________________
> >> >poppler mailing list
> >> >poppler at lists.freedesktop.org
> >> >http://lists.freedesktop.org/mailman/listinfo/poppler
> >> >
> >> 
> >
> >
> >_______________________________________________
> >poppler mailing list
> >poppler at lists.freedesktop.org
> >http://lists.freedesktop.org/mailman/listinfo/poppler
> 




More information about the poppler mailing list