[poppler] pdftohtml using Poppler

Josh Richardson jric at chegg.com
Wed Jul 20 11:39:51 PDT 2011


Pdf.js is an interesting project, but it's light-years behind Poppler in
terms of capability to accurately draw a PDF, as well as support a variety
of browsers.

Poppler's solution also has some other advantages:

  * Accessible environment.  If you're planning to interact with or parse
any of the "elements" in the rendered PDF, everyone knows how to do that
with HTML. With a canvas-rendered PDF, I don't think that's going to be as
easy.
* Pre-formatted HTML.  Having a pre-rendered HTML document is always going
to be faster to view than spinning up a JavaScript engine to render
everything "on the fly".
  * No JS needed.  A lot of people have JavaScript turned off for security
reasons.  I was told by a credible source that around 30% of American
workplaces do not allow JS-enabled web-browsing.
* Added semantics. Poppler's "coalescence" functionality creates meaning
in some cases where there is none in the underlying PDF.





On 7/20/11 11:03 AM, "Albert Astals Cid" <aacid at kde.org> wrote:

>A Dimecres, 20 de juliol de 2011, Akash Agrawal vàreu escriure:
>> Hi All,
>
>Hi
>
>> 
>> My name is Akash Agrawal and I am working on producing a full-fledged
>>pdf to
>> html solution. I investigated poppler and made a lot of custom changes
>>for
>> my requirement. I got your reference from revision log in pdfthtml
>>source
>> files. 
>
>Noone in this list is amongst the original programmers of pdftohtml so
>there is noone with lots of knowledge over it (I for one
>basically ignore most of the things it does or tries to do)
>
>> I will appreciate if you can address my queries. I am stuck at 2
>> issues currently:
>> 
>>    1. z-index
>>    2. Fonts
>> 
>> *z-index:* In it's current solution, poppler's pdftohtml puts all the
>> non-text data into an image and use this image as a background image in
>> html. But at times, there are pdfs which have image/graphics over the
>>text
>> and current solution fails in such case. I looked into Gfx and
>>OutputDevice
>> code and didn't reach a good workable solution for this case. I will be
>> highly indebted if you can suggest some pointers.
>
>The guys from pdf.js render everything into an image and then they are
>planning on exposing the text to the user via some advanced
>html5/css3 trickery.
>
>> 
>> *Fonts:* Fonts are the biggest problem here. I saw that currently, it
>> outputs all fonts as Times (default font), so I fixed that with exact
>>font
>> names (with tag coz multiple versions of a same fonts might be present
>>in
>> pdf). I also made non-horizontal text as part of image coz rotating the
>> glyphs were not a very good idea to me seeing the time in hand. I am
>>also
>> able to extract font data but facing difficulties to extract encoding
>>info
>> like cmap etc. 
>
>CMaps are extracted in the CMap.cc file. You might also want to have a
>look at FoFiTrueType::writeTTF that is supposed to write a
>"corrected" TTF file to disk from back when we did not use FreeType
>memory functions.
>
>> Your pointers on the same will be very much appreciated. FYI
>> I am using fontforge to convert extracted fonts in a common format (ttf
>>in
>> my case). I am thing to apply cmaps using fontforge. Please let me know
>>if
>> you suggest otherwise.
>
>Have a look at the list, there was a discussion already on extracting
>fonts from PDF files and some people suggested you might get
>sued if you do that.
>
>On the other hand i wonder if you guys should not just be helping the
>mozilla dudes that implement pdf.js since that will mean pdf
>viewing in browsers that is what you seem to want.
>
>Albert
>
>> 
>> I am waiting for a positive response from your side regarding the same.
>> Looking forward for a strong technical relationship.
>> 
>> Regards,
>> Akash Agrawal
>> http://tech-queries.blogspot.com/
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler




More information about the poppler mailing list