[poppler] Reverse-engineering an XML file generated by pdftohtml -xml back into the PDF?

Alec Taylor alec.taylor6 at gmail.com
Tue Nov 15 21:06:02 PST 2011


I believe someone mentioned a pdftopdf utility for Poppler.  That's a
start!  It would be best if it were built on a library foundation.

^That was me!

Hmm, we could build "steal" (fork) some code from GPL and LGPL
projects such as Hummus PDF (http://pdfhummus.com/) and maybe even
MuPDF (http://www.mupdf.com/).

But from the looks of things, it would make more sense to create a new
PDF document from the XML, which would require extending the XML
output to include much more information.

So maybe add in a verbosity level for xml output?

But this is all quite extensive...

It would be great if I could just repair the page numbers for now.
What kind of dictionary do I need to parse, where?

(future plans include linking the ToC and separating the header and
footer logically from the rest of the page; the hf extraction already
98% completed))

On Wed, Nov 16, 2011 at 8:26 AM, Josh Richardson <jric at chegg.com> wrote:
> Sure, my bad for attempting humor.  :-)  My point is that I hope someone
> will take up the cause to make Poppler such a library, because this (and
> the XPDF) community have put so much effort into making Poppler the best
> open-source parser and rendering engine out there -- it would be great to
> be able to leverage Poppler for easy PDF file manipulation as well.  I've
> had less success using some of those "other" solutions on the variety of
> files that Poppler can handle, including Adobe's own products.
>
> If I get a chance, I may start delving into some of that.  Anyone think
> I'm crazy?  I'd love to know.
>
> I believe someone mentioned a pdftopdf utility for Poppler.  That's a
> start!  It would be best if it were built on a library foundation.
>
> --josh
>
> On 11/15/11 1:14 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>
>>On 11/15/11 12:32 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>
>>>*Whistfully*:  If only there were a PDF library to make such things
>>>simple....
>>
>>There are a whole bunch of them, ranging from open source to commercial in
>>languages from Java to Python to C++ and more...
>>
>>Leonard
>>
>>
>
>


More information about the poppler mailing list