[poppler] Output images from pdftohtml in xml mode

Mike Tonks fluffymike at googlemail.com
Thu Apr 15 02:27:13 PDT 2010


Hi,

I'm new here and I'm considering a patch to pdftohtml (well
HtmlOutputDev).  I'm coming from a perl background so I may not get it
right first time but I'll do my best!  It will be my first patch so
any help appreciated.

Changes:

1) Include the images in xml mode unless -ignore is specified.

2) Include the top, left, width, height data in img tags, where
appropriate depending on mode.  Not applicable to complex mode, in
html mode height and width probably useful, positioning would be great
but can be expanded later if required e.g. left, right or position
relative to text.  In xml mode just output all available data.

Use Case: I'm post processing the xml and I do need the image data to
be output.  It's part of a workflow to produce epub ebook format from
pdf.

I've had a look at the code and it seems fairly straight forward, as
the images are already output in other modes.  Currently only the
image src attribute is passed through so I guess there needs to be a
new HtmlImage class (plus HtmlImages / HtmlImageAccu to handle the
iteration).  It looks like I can base this on the HtmlFont & HtmlLink
modules, so I'll just follow the existing patterns there.


Would you be likely to accept this patch once I get it working?  Any
suggestions?

cheers,

mike


More information about the poppler mailing list