[poppler] images in pdftohtml -xml mode

Igor Slepchin igor.slepchin at gmail.com
Mon Nov 14 16:19:46 PST 2011


I know that dumping images when running pdftohtml with -xml flag has 
been brought up before and it seems that the devs said they would accept 
a patch; however, it looks like nothing has made it into the source tree 
so far. I figured I could give this a try too so please take a look at 
my proposed changes if there is still some interest in this 
functionality: https://github.com/igors/poppler/tree/xml_images

The first commit in the above branch fixes up pdf2xml.dtd to match what 
pdftohtml generates; the second patch adds support for images in -xml 
mode. With this patch applied, pdftohtml -xml will dump all image files 
just like it does in html mode and will add image elements at the 
beginning of each page that has images, i.e., you'll see something like 
the following in the generated xml:

<page number="51" position="absolute" top="0" left="0"
       height="896" width="572">
<image top="45" left="26" width="523" height="373" src="filename.jpg"/>
<text top="534" left="81" width="17" height="15" font="18">In </text>

The default behavior with -xml switch is to process images now; adding 
-i option restores the old behavior.

The change is small enough that I hope it won't be very controversial 
but comments are certainly appreciated.

Thanks,
Igor


More information about the poppler mailing list