[poppler] poppler util pdftohtml

Justine Guillaumont justine.guillaumont at gmail.com
Thu Sep 22 04:23:02 PDT 2011


Hi all,

My name is Justine Guillaumont, I am completing  my engineering studies by a
6-months internship. I am working on the opensource project WebLab (
weblab-project.org).
I am currently using poppler-0.16.7 (I tried to install poppler-0.17.4 but
libpoppler.so.17 is missing).
One of the purposes of my internship is to transform PDF files into XHTML
files that will give the same structured display. In order to doing this, I
use pdftohtml -nodrm -p -s (to obtain HTML) and then a script and XSL (to
obtain XHTML). I encountered several problems with pdftohtml that I would
like to share in order to have your opinion.

1) Would it be possible to have the width and height of the tag DIV in the
BODY ?
I noticed that with have it with pdftohtml -xml (in the tags TEXT) but not
with pdftohtml -nodrm -p -s. I tried to modifiy your code (HtmlOutputDev.cc)
but I only "sucess" to collect the width and height of the first word of the
DIV.

2) The HTML generate by pdftohtml is not validated by W3C (
http://validator.w3.org/)
It is sad because you don't have much to modify to obtain valid HTML 4 or
XHTML. If you like, I can send you the xsl I made to transform the HTML
generate by pdftohtml -p -s into valid HTML4.

3) With arabic PDF, pdftohtml seems to read correctely the PDF (from rigth
to left) and to write the HTML upside-down / backwards (from left to right).
All words are reversed. Would that be corrected soon ?
Please find attached an example of this problem.

Regards,

Justine Guillaumont
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110922/744ca817/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: arabic_pdf_example.tar.gz
Type: application/x-gzip
Size: 116482 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110922/744ca817/attachment-0001.bin>


More information about the poppler mailing list