[poppler] images in pdftohtml -xml mode
Igor Slepchin
igor.slepchin at gmail.com
Mon Nov 14 16:19:46 PST 2011
I know that dumping images when running pdftohtml with -xml flag has
been brought up before and it seems that the devs said they would accept
a patch; however, it looks like nothing has made it into the source tree
so far. I figured I could give this a try too so please take a look at
my proposed changes if there is still some interest in this
functionality: https://github.com/igors/poppler/tree/xml_images
The first commit in the above branch fixes up pdf2xml.dtd to match what
pdftohtml generates; the second patch adds support for images in -xml
mode. With this patch applied, pdftohtml -xml will dump all image files
just like it does in html mode and will add image elements at the
beginning of each page that has images, i.e., you'll see something like
the following in the generated xml:
<page number="51" position="absolute" top="0" left="0"
height="896" width="572">
<image top="45" left="26" width="523" height="373" src="filename.jpg"/>
<text top="534" left="81" width="17" height="15" font="18">In </text>
The default behavior with -xml switch is to process images now; adding
-i option restores the old behavior.
The change is small enough that I hope it won't be very controversial
but comments are certainly appreciated.
Thanks,
Igor
More information about the poppler
mailing list