[poppler] Reverse-engineering an XML file generated by pdftohtml -xml back into the PDF?

Leonard Rosenthol lrosenth at adobe.com
Tue Nov 15 12:28:58 PST 2011


For proper structure, you are going to need to find a way to match the
structure information with the elements in the content stream and then
somehow modify the stream accordingly (and add the relevant dictionaries,
etc.)

On 11/15/11 12:23 PM, "Josh Richardson" <jric at chegg.com> wrote:

>Someone on the list may have a better idea, but I would almost certainly
>start with the PDFDoc created by reading the original document, and inject
>back in the meta-data that you have collected -- I believe this was
>Leonard's recommendation as well.
>
>--josh
>
>On 11/14/11 10:42 PM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>
>>Good afternoon,
>>
>>How would I go about reverse-engineering an XML file generated by
>>pdftohtml -xml bak into the [same] PDF?
>>
>>I have been spending a long time extending the XML output to include
>>proper page numbers and header/footer detection.
>>
>>It would be extremely useful if I could push the additional logical
>>structure information and page numbers back into the PDF the XML was
>>generated from.
>>
>>How would I go about doing this?
>>
>>Thanks for all suggestions,
>>
>>Alec Taylor
>>
>>PS: T-9 days (or less!) until PATCH :)
>>_______________________________________________
>>poppler mailing list
>>poppler at lists.freedesktop.org
>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list