[poppler] smaller HTML images

Josh Richardson jric at chegg.com
Wed Jun 22 21:23:30 PDT 2011

Currently pdftohtml is creating one large image for each HTML page rendered.  In order to reduce the size of the HTML file bundles, as well as to improve the semantic value of the HTML, Stephen and I would like to extract and use only the portions of that background image that are not background white.

In order to accomplish this, our idea is to add hooks into the SplashOutputDevNoText to catch painting operations, and record coordinates of the bounding box for any painting operations.  After recording each bounding box, we'll draw a new bounding box to combine any contiguous regions.  Once we have a list of non-contiguous bounding boxes representing all graphics operations that have occurred on the page, we'll use those bounding boxes to extract only the relevant regions from the large background image, save each region as a separate file, and reference the files from the HTML.

Since we're extending the output device, we'll rename it from SplashOutputDevNoText to better capture the new role:  SplashOutputDevHtmlImages.  If you think we should retain the old behavior with a switch, please let me know — I don't see a significant benefit to it.

As always, any comments appreciated.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110622/be91e48e/attachment.html>

More information about the poppler mailing list