[poppler] pdftohtml -xml custom built (image tags but no pngs extracted)

Kai Fritsch kai.m.fritsch at gmail.com
Mon May 7 07:07:59 UTC 2018


Hey,

I'm using the xml output of pdftohtml to classify pdfs. I wondered if it
would be easy to create an option in a custom built to have the image tags
in the xml without extracting the images themselves. I have to classify a
lot of pdfs and some of those are powerpoint presentations with lots of
small images (e.g. 26000 per page) which take several hours to extract. I
need the image tags for some of my features for classification.

If someone could point me to the place in the code where I could make that
change that would be very much appreciated. Otherwise I have to check the
code myself.

Many Thanks,
Kai
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20180507/b2cf0537/attachment.html>


More information about the poppler mailing list