[poppler] pdftohtml using Poppler

Albert Astals Cid aacid at kde.org
Wed Jul 20 11:03:45 PDT 2011


A Dimecres, 20 de juliol de 2011, Akash Agrawal vàreu escriure:
> Hi All,

Hi

> 
> My name is Akash Agrawal and I am working on producing a full-fledged pdf to
> html solution. I investigated poppler and made a lot of custom changes for
> my requirement. I got your reference from revision log in pdfthtml source
> files. 

Noone in this list is amongst the original programmers of pdftohtml so there is noone with lots of knowledge over it (I for one 
basically ignore most of the things it does or tries to do)

> I will appreciate if you can address my queries. I am stuck at 2
> issues currently:
> 
>    1. z-index
>    2. Fonts
> 
> *z-index:* In it's current solution, poppler's pdftohtml puts all the
> non-text data into an image and use this image as a background image in
> html. But at times, there are pdfs which have image/graphics over the text
> and current solution fails in such case. I looked into Gfx and OutputDevice
> code and didn't reach a good workable solution for this case. I will be
> highly indebted if you can suggest some pointers.

The guys from pdf.js render everything into an image and then they are planning on exposing the text to the user via some advanced 
html5/css3 trickery.

> 
> *Fonts:* Fonts are the biggest problem here. I saw that currently, it
> outputs all fonts as Times (default font), so I fixed that with exact font
> names (with tag coz multiple versions of a same fonts might be present in
> pdf). I also made non-horizontal text as part of image coz rotating the
> glyphs were not a very good idea to me seeing the time in hand. I am also
> able to extract font data but facing difficulties to extract encoding info
> like cmap etc. 

CMaps are extracted in the CMap.cc file. You might also want to have a look at FoFiTrueType::writeTTF that is supposed to write a 
"corrected" TTF file to disk from back when we did not use FreeType memory functions.

> Your pointers on the same will be very much appreciated. FYI
> I am using fontforge to convert extracted fonts in a common format (ttf in
> my case). I am thing to apply cmaps using fontforge. Please let me know if
> you suggest otherwise.

Have a look at the list, there was a discussion already on extracting fonts from PDF files and some people suggested you might get 
sued if you do that.

On the other hand i wonder if you guys should not just be helping the mozilla dudes that implement pdf.js since that will mean pdf 
viewing in browsers that is what you seem to want.

Albert

> 
> I am waiting for a positive response from your side regarding the same.
> Looking forward for a strong technical relationship.
> 
> Regards,
> Akash Agrawal
> http://tech-queries.blogspot.com/


More information about the poppler mailing list