[poppler] Retrieve all objects from a PDF file

Leonard Rosenthol lrosenth at adobe.com
Wed Nov 2 06:14:51 PDT 2011


What about non-dictionary objects with indirect IDs?   Remember that ANY
PDF object (null, number, string, etc. ) can be indirect.

And how do you handle recursion?  Or do you simply treat each indirect
object as unique and not related?

Leonard

On 11/2/11 9:00 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:

>I am doing some research on the structure of PDF files. I wrote a
>utility to convert the object (i.e., dictionary) structure of PDFs into
>XML so that I can query the structure using XPath or similar query
>languages. I also care about the context, and the context can be rebuilt
>from the resulting XML when necessary.
>
>Nedim
>
>On Tue, 2011-11-01 at 05:26 -0700, Leonard Rosenthol wrote:
>> Why would you iterate over the objects w/o any understanding of their
>> context?  Wouldn't it make MUCH MORE sense to "walk the tree" - starting
>> at the Catalog/Root and then simply recursing down the object tree based
>> on known relationships?
>> 
>> What use are the objects w/o context?
>> 
>> Leonard
>> 
>> On 11/1/11 7:55 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:
>> 
>> >I'm sorry, I see now that I wasn't clear enough. I would like to
>> >enumerate every PDF dictionary from a given PDF file, including but not
>> >limited to the Catalog, Pages, Actions, Annotations, Name tree -
>> >everything. Currently I can successfully do that for all dictionaries
>> >that can be located using XRef, but it seems that indirect objects
>> >inside object streams cannot be found this way. I could obviously test
>> >if any of the objects pointed to by the XRef is an object stream and
>>get
>> >all the objects from the stream, but I'm wondering if Poppler has a
>>more
>> >elegant solution.
>> >
>> >Nedim
>> >
>> >On Mon, 2011-10-31 at 11:12 -0700, Josh Richardson wrote:
>> >> What kinds of objects are you interested in?  I have a version of
>> >> pdftohtml which I believe is not yet merged into the master repo that
>> >> extracts images and fonts.
>> >> 
>> >> --josh
>> >> 
>> >> On 10/31/11 9:16 AM, "Nedim Srndic" <nedim.sh at gmail.com> wrote:
>> >> 
>> >> >Dear list, 
>> >> >
>> >> >I am using the Poppler library (in the src/poppler folder, no
>>bindings,
>> >> >version 7 from the Ubuntu 10.10 repos) and would like to retrieve
>>all
>> >> >objects from a PDF file. Currently, I am running a loop on XRef and
>> >> >getting all the non-null objects from it, but it doesn't seem to
>> >> >retrieve objects from object streams. What solution would you
>>propose
>> >> >for this problem?
>> >> >
>> >> >Thanks, 
>> >> >Nedim Srndic
>> >> >
>> >> >_______________________________________________
>> >> >poppler mailing list
>> >> >poppler at lists.freedesktop.org
>> >> >http://lists.freedesktop.org/mailman/listinfo/poppler
>> >> >
>> >> 
>> >
>> >
>> >_______________________________________________
>> >poppler mailing list
>> >poppler at lists.freedesktop.org
>> >http://lists.freedesktop.org/mailman/listinfo/poppler
>> 
>
>
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list