[poppler] pdftohtml using Poppler

Wed Jul 20 05:36:58 PDT 2011

Hi All,

My name is Akash Agrawal and I am working on producing a full-fledged pdf to
html solution. I investigated poppler and made a lot of custom changes for
my requirement. I got your reference from revision log in pdfthtml source
files. I will appreciate if you can address my queries. I am stuck at 2
issues currently:

   1. z-index
   2. Fonts

*z-index:* In it's current solution, poppler's pdftohtml puts all the
non-text data into an image and use this image as a background image in
html. But at times, there are pdfs which have image/graphics over the text
and current solution fails in such case. I looked into Gfx and OutputDevice
code and didn't reach a good workable solution for this case. I will be
highly indebted if you can suggest some pointers.

*Fonts:* Fonts are the biggest problem here. I saw that currently, it
outputs all fonts as Times (default font), so I fixed that with exact font
names (with tag coz multiple versions of a same fonts might be present in
pdf). I also made non-horizontal text as part of image coz rotating the
glyphs were not a very good idea to me seeing the time in hand. I am also
able to extract font data but facing difficulties to extract encoding info
like cmap etc. Your pointers on the same will be very much appreciated. FYI
I am using fontforge to convert extracted fonts in a common format (ttf in
my case). I am thing to apply cmaps using fontforge. Please let me know if
you suggest otherwise.

I am waiting for a positive response from your side regarding the same.
Looking forward for a strong technical relationship.

Regards,
Akash Agrawal
http://tech-queries.blogspot.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110720/b6480cf9/attachment.htm>