[poppler] Reverse-engineering an XML file generated by pdftohtml -xml back into the PDF?

Josh Richardson jric at chegg.com
Tue Nov 15 12:32:32 PST 2011


*Whistfully*:  If only there were a PDF library to make such things
simple....

On 11/15/11 12:28 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:

>For proper structure, you are going to need to find a way to match the
>structure information with the elements in the content stream and then
>somehow modify the stream accordingly (and add the relevant dictionaries,
>etc.)
>
>On 11/15/11 12:23 PM, "Josh Richardson" <jric at chegg.com> wrote:
>
>>Someone on the list may have a better idea, but I would almost certainly
>>start with the PDFDoc created by reading the original document, and
>>inject
>>back in the meta-data that you have collected -- I believe this was
>>Leonard's recommendation as well.
>>
>>--josh
>>
>>On 11/14/11 10:42 PM, "Alec Taylor" <alec.taylor6 at gmail.com> wrote:
>>
>>>Good afternoon,
>>>
>>>How would I go about reverse-engineering an XML file generated by
>>>pdftohtml -xml bak into the [same] PDF?
>>>
>>>I have been spending a long time extending the XML output to include
>>>proper page numbers and header/footer detection.
>>>
>>>It would be extremely useful if I could push the additional logical
>>>structure information and page numbers back into the PDF the XML was
>>>generated from.
>>>
>>>How would I go about doing this?
>>>
>>>Thanks for all suggestions,
>>>
>>>Alec Taylor
>>>
>>>PS: T-9 days (or less!) until PATCH :)
>>>_______________________________________________
>>>poppler mailing list
>>>poppler at lists.freedesktop.org
>>>http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>
>>_______________________________________________
>>poppler mailing list
>>poppler at lists.freedesktop.org
>>http://lists.freedesktop.org/mailman/listinfo/poppler
>
>



More information about the poppler mailing list