Hi all,<br><br>My name is Justine Guillaumont, I am completing my
engineering studies by a 6-months internship. I am working on the
opensource project WebLab (<a href="http://weblab-project.org/" target="_blank">weblab-project.org</a>).<br>
I am currently using poppler-0.16.7 (I tried to install poppler-0.17.4 but libpoppler.so.17 is missing).<br>
One of the purposes of my internship is to transform PDF files into
XHTML files that will give the same structured display. In order to
doing this, I use pdftohtml -nodrm -p -s (to obtain HTML) and then a
script and XSL (to obtain XHTML). <span lang="en"><span>I encountered</span> <span>several problems</span></span> with pdftohtml that <span lang="en"><span>I</span> <span>would like to share in order to have your opinion.<br>
<br>1) Would it be possible to have the width and height of the tag DIV in the BODY ?<br>I noticed that with have it with pdftohtml -xml (in the tags TEXT) but not with pdftohtml </span></span>
-nodrm -p -s. I tried to modifiy your code (HtmlOutputDev.cc) but I
only "sucess" to collect the width and height of the first word of the
DIV.<br>
<br>2) The HTML generate by pdftohtml is not validated by W3C (<a href="http://validator.w3.org/" target="_blank">http://validator.w3.org/</a>)<br>It
is sad because you don't have much to modify to obtain valid HTML 4 or
XHTML. If you like, I can send you the xsl I made to transform the HTML
generate by pdftohtml -p -s into valid HTML4.<br>
<br>
3) With arabic PDF, pdftohtml seems to read correctely the PDF (from
rigth to left) and to write the HTML upside-down / backwards (from left
to right). <span lang="en"><span>All words</span> <span>are reversed</span></span>. Would that be corrected soon ?<br>
Please find attached an example of this problem.<br><br>Regards,<br><br>Justine Guillaumont<font color="#888888"><br>
</font>