[poppler] poppler util pdftohtml

Thu Sep 22 12:03:18 PDT 2011

Justine, I have done some recent work on the pdftohtml utility.  I recommend grabbing the latest version off of GIT and building it.  It's not too hard.

The latest version of pdftohtml generates valid XHTML and has the width and height in the body div.

It may also be that your problem with Arabic is fixed in the latest version (make sure you use "complex" option), because I don't know how backwards rendering would possibly happen with that version, but if it does, I'd love to know more about it.  I would expect it to write out the characters in the same order in which they are printed on the page, from left to write.  This could cause issues with ligatures, however.

--josh

From: Justine Guillaumont <justine.guillaumont at gmail.com<mailto:justine.guillaumont at gmail.com>>
Date: Thu, 22 Sep 2011 04:23:02 -0700
To: "poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>" <poppler at lists.freedesktop.org<mailto:poppler at lists.freedesktop.org>>
Subject: [poppler] poppler util pdftohtml

Hi all,

My name is Justine Guillaumont, I am completing  my engineering studies by a 6-months internship. I am working on the opensource project WebLab (weblab-project.org<http://weblab-project.org/>).
I am currently using poppler-0.16.7 (I tried to install poppler-0.17.4 but libpoppler.so.17 is missing).
One of the purposes of my internship is to transform PDF files into XHTML files that will give the same structured display. In order to doing this, I use pdftohtml -nodrm -p -s (to obtain HTML) and then a script and XSL (to obtain XHTML). I encountered several problems with pdftohtml that I would like to share in order to have your opinion.

1) Would it be possible to have the width and height of the tag DIV in the BODY ?
I noticed that with have it with pdftohtml -xml (in the tags TEXT) but not with pdftohtml -nodrm -p -s. I tried to modifiy your code (HtmlOutputDev.cc) but I only "sucess" to collect the width and height of the first word of the DIV.

2) The HTML generate by pdftohtml is not validated by W3C (http://validator.w3.org/)
It is sad because you don't have much to modify to obtain valid HTML 4 or XHTML. If you like, I can send you the xsl I made to transform the HTML generate by pdftohtml -p -s into valid HTML4.

3) With arabic PDF, pdftohtml seems to read correctely the PDF (from rigth to left) and to write the HTML upside-down / backwards (from left to right). All words are reversed. Would that be corrected soon ?
Please find attached an example of this problem.

Regards,

Justine Guillaumont
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110922/473106ce/attachment.htm>