[poppler] Retrieve all objects from a PDF file

Nedim Srndic nedim.sh at gmail.com
Wed Nov 2 08:40:40 PDT 2011


On Wed, 2011-11-02 at 06:14 -0700, Leonard Rosenthol wrote:
> What about non-dictionary objects with indirect IDs?   Remember that ANY
> PDF object (null, number, string, etc. ) can be indirect.
> 

I treat all objects the same, i.e., I take all entries in XRef and
convert them to XML. I also convert indirect reference objects, so that
I can recreate the document's tree structure. In effect, I preserve all
information from the PDF that is referenced in the XRef. This excludes
the xref table/stream itself, the trailer dictionary, file header, EOF
and any possible old-generation objects. 

> And how do you handle recursion?  Or do you simply treat each indirect
> object as unique and not related?
> 

I keep the flat physical representation of the PDF, just like in the PDF
format. 

> Leonard
> 
> On 11/2/11 9:00 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:
> 
> >I am doing some research on the structure of PDF files. I wrote a
> >utility to convert the object (i.e., dictionary) structure of PDFs into
> >XML so that I can query the structure using XPath or similar query
> >languages. I also care about the context, and the context can be rebuilt
> >from the resulting XML when necessary.
> >
> >Nedim
> >
> >On Tue, 2011-11-01 at 05:26 -0700, Leonard Rosenthol wrote:
> >> Why would you iterate over the objects w/o any understanding of their
> >> context?  Wouldn't it make MUCH MORE sense to "walk the tree" - starting
> >> at the Catalog/Root and then simply recursing down the object tree based
> >> on known relationships?
> >> 
> >> What use are the objects w/o context?
> >> 
> >> Leonard
> >> 
> >> On 11/1/11 7:55 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:
> >> 
> >> >I'm sorry, I see now that I wasn't clear enough. I would like to
> >> >enumerate every PDF dictionary from a given PDF file, including but not
> >> >limited to the Catalog, Pages, Actions, Annotations, Name tree -
> >> >everything. Currently I can successfully do that for all dictionaries
> >> >that can be located using XRef, but it seems that indirect objects
> >> >inside object streams cannot be found this way. I could obviously test
> >> >if any of the objects pointed to by the XRef is an object stream and
> >>get
> >> >all the objects from the stream, but I'm wondering if Poppler has a
> >>more
> >> >elegant solution.
> >> >
> >> >Nedim
> >> >
> >> >On Mon, 2011-10-31 at 11:12 -0700, Josh Richardson wrote:
> >> >> What kinds of objects are you interested in?  I have a version of
> >> >> pdftohtml which I believe is not yet merged into the master repo that
> >> >> extracts images and fonts.
> >> >> 
> >> >> --josh
> >> >> 
> >> >> On 10/31/11 9:16 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:
> >> >> 
> >> >> >Dear list, 
> >> >> >
> >> >> >I am using the Poppler library (in the src/poppler folder, no
> >>bindings,
> >> >> >version 7 from the Ubuntu 10.10 repos) and would like to retrieve
> >>all
> >> >> >objects from a PDF file. Currently, I am running a loop on XRef and
> >> >> >getting all the non-null objects from it, but it doesn't seem to
> >> >> >retrieve objects from object streams. What solution would you
> >>propose
> >> >> >for this problem?
> >> >> >
> >> >> >Thanks, 
> >> >> >Nedim Srndic
> >> >> >
> >> >> >_______________________________________________
> >> >> >poppler mailing list
> >> >> >poppler at lists.freedesktop.org
> >> >> >http://lists.freedesktop.org/mailman/listinfo/poppler
> >> >> >
> >> >> 
> >> >
> >> >
> >> >_______________________________________________
> >> >poppler mailing list
> >> >poppler at lists.freedesktop.org
> >> >http://lists.freedesktop.org/mailman/listinfo/poppler
> >> 
> >
> >
> >_______________________________________________
> >poppler mailing list
> >poppler at lists.freedesktop.org
> >http://lists.freedesktop.org/mailman/listinfo/poppler
> 




More information about the poppler mailing list