[poppler] pdftohtml using Poppler

Josh Richardson jric at chegg.com
Wed Jul 20 11:17:21 PDT 2011


You should have read the list sooner - you could have saved some time.  :-)

Rotated text is solved, but hasn't been committed to the freedesktop repository.  See https://bugs.freedesktop.org/show_bug.cgi?id=38586 .

Fonts are mostly solved, see https://bugs.freedesktop.org/show_bug.cgi?id=39385 .

It works, but you have to separately extract the fonts.  Ideally, we would include the code to extract the fonts with Poppler.  Right now you can use Mu PDF, pdfextract, to extract the fonts, and FontForge operates on them just fine.  Mu PDF doesn't seem to be as forgiving as Poppler in terms of ill-formed PDF documents.

Z-index is a problem I've thought about too, but I haven't had any use cases yet, so we haven't tackled it.  I believe that what should be done, as you guessed, is that the text and graphics output devices need to be combined into a single device.  What makes this a bit tricky is that I believe we have to support the "HAVE_SPLASH" precompiler flag.  So, in order to derive the HtmlOutputDev from the SplashOutputDev, it would probably have to be done conditionally.  Ok, so you combine them.  Then what?

If you look closely at the patches in the second bug referenced above, you'll see that we're keeping track of the bounding box of each drawing operation, using a new ImageProperties class.  Then we coalesce them to find the regions of the "big background image" to extract into individual html images.  Now, if you were to extend that to also keep track of text showing operations, you could use the order of those overlapping regions in the list as the z-index.

Currently, the image extraction algorithm isn't using the alpha channel.  You would obviously need to fix that.  I don't know if the underlying Poppler library starts with a blank canvas (alpha = 0) or with a blank white canvas (alpha = 1, color = 0xFFFFFF), but I believe you would need the former, not the latter.

Hope it helps.  Let us know what you're working on, so that we don't duplicate effort and create competing solutions to the same problems.  Btw., what Stephen and I are working on now is to fix text spacing.  Now that we're using the right font, we're much closer, but there are still a bunch of issues.  Stephen's already made some great progress, and we'll be submitting patches soon.

We look forward to working with you!

Best, --josh

From: Akash Agrawal <akash.agrawal84 at gmail.com<mailto:akash.agrawal84 at gmail.com>>
Date: Wed, 20 Jul 2011 05:36:58 -0700
To: "poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>" <poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>>
Subject: [poppler] pdftohtml using Poppler

Hi All,

My name is Akash Agrawal and I am working on producing a full-fledged pdf to html solution. I investigated poppler and made a lot of custom changes for my requirement. I got your reference from revision log in pdfthtml source files. I will appreciate if you can address my queries. I am stuck at 2 issues currently:

 1.  z-index
 2.  Fonts

z-index: In it's current solution, poppler's pdftohtml puts all the non-text data into an image and use this image as a background image in html. But at times, there are pdfs which have image/graphics over the text and current solution fails in such case. I looked into Gfx and OutputDevice code and didn't reach a good workable solution for this case. I will be highly indebted if you can suggest some pointers.

Fonts: Fonts are the biggest problem here. I saw that currently, it outputs all fonts as Times (default font), so I fixed that with exact font names (with tag coz multiple versions of a same fonts might be present in pdf). I also made non-horizontal text as part of image coz rotating the glyphs were not a very good idea to me seeing the time in hand. I am also able to extract font data but facing difficulties to extract encoding info like cmap etc. Your pointers on the same will be very much appreciated. FYI I am using fontforge to convert extracted fonts in a common format (ttf in my case). I am thing to apply cmaps using fontforge. Please let me know if you suggest otherwise.

I am waiting for a positive response from your side regarding the same. Looking forward for a strong technical relationship.

Regards,
Akash Agrawal
http://tech-queries.blogspot.com/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110720/1726e1a6/attachment.html>


More information about the poppler mailing list