<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div>Justine, I have done some recent work on the pdftohtml utility. I recommend grabbing the latest version off of GIT and building it. It's not too hard.</div><div><br></div><div>The latest version of pdftohtml generates valid XHTML and has the width and height in the body div.</div><div><br></div><div>It may also be that your problem with Arabic is fixed in the latest version (make sure you use "complex" option), because I don't know how backwards rendering would possibly happen with that version, but if it does, I'd love to know more about it. I would expect it to write out the characters in the same order in which they are printed on the page, from left to write. This could cause issues with ligatures, however.</div><div><br></div><div>--josh</div><div><br></div><span id="OLK_SRC_BODY_SECTION"><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> Justine Guillaumont <<a href="mailto:justine.guillaumont@gmail.com">justine.guillaumont@gmail.com</a>><br><span style="font-weight:bold">Date: </span> Thu, 22 Sep 2011 04:23:02 -0700<br><span style="font-weight:bold">To: </span> "<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>" <<a href="mailto:poppler@lists.freedesktop.org">poppler@lists.freedesktop.org</a>><br><span style="font-weight:bold">Subject: </span> [poppler] poppler util pdftohtml<br></div><div><br></div>Hi all,<br><br>My name is Justine Guillaumont, I am completing my
engineering studies by a 6-months internship. I am working on the
opensource project WebLab (<a href="http://weblab-project.org/" target="_blank">weblab-project.org</a>).<br>
I am currently using poppler-0.16.7 (I tried to install poppler-0.17.4 but libpoppler.so.17 is missing).<br>
One of the purposes of my internship is to transform PDF files into
XHTML files that will give the same structured display. In order to
doing this, I use pdftohtml -nodrm -p -s (to obtain HTML) and then a
script and XSL (to obtain XHTML). <span lang="en"><span>I encountered</span> <span>several problems</span></span> with pdftohtml that <span lang="en"><span>I</span> <span>would like to share in order to have your opinion.<br><br>1) Would it be possible to have the width and height of the tag DIV in the BODY ?<br>I noticed that with have it with pdftohtml -xml (in the tags TEXT) but not with pdftohtml </span></span>
-nodrm -p -s. I tried to modifiy your code (HtmlOutputDev.cc) but I
only "sucess" to collect the width and height of the first word of the
DIV.<br><br>2) The HTML generate by pdftohtml is not validated by W3C (<a href="http://validator.w3.org/" target="_blank">http://validator.w3.org/</a>)<br>It
is sad because you don't have much to modify to obtain valid HTML 4 or
XHTML. If you like, I can send you the xsl I made to transform the HTML
generate by pdftohtml -p -s into valid HTML4.<br><br>
3) With arabic PDF, pdftohtml seems to read correctely the PDF (from
rigth to left) and to write the HTML upside-down / backwards (from left
to right). <span lang="en"><span>All words</span> <span>are reversed</span></span>. Would that be corrected soon ?<br>
Please find attached an example of this problem.<br><br>Regards,<br><br>Justine Guillaumont<font color="#888888"><br></font></span></body></html>