1) I compared the rendering of pdftohtml with [-c], [-s] and [-c -s]. <br>The options -c -s don't generate xhtml because you're putting several <html> in the html file (like with -s). You could just merge the contents of <style>, do the same for the contents of <body> and then obtain an xhtml file. <br>
I tried to modify your code to do it but I really didn't succeed to handle it... This is why I did a XSL and not a patch. Maybe you will succeed to do it. Would it be easy for you ?<br><br>Did you notice that using -c give a different cuttering of the text (more precise) and a better rendering of the font-size, font-color and text-alig ? Is it normal ?<br>
<br>2) I did look carefully at your file and it suits me well ! I'm looking forward using your next stable version !<br><br>3) Like I said in 1), I don't handle your code, but I hope you will find how to manage right-to-left text !<br>
<br>Justine<br><br><br><div class="gmail_quote">2011/9/24 Josh Richardson <span dir="ltr"><<a href="mailto:jric@chegg.com">jric@chegg.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div style="word-wrap:break-word;color:rgb(0, 0, 0);font-size:14px;font-family:Calibri, sans-serif"><div>Sorry for the delay — been on an airplane all day — and had a lot of emails to read on the list. ;-)</div><div><br>
</div><div>1) You can use both –s and –c at the same time.</div><div>2) Ok, was worth a shot. I've lost track a little bit where the code base is — I haven't yet contributed back everything, just because it takes time to format the patches. I definitely have code that embeds the size of each paragraph — well, at least I think it's what you want. I've attached a sample file — let me know.</div>
<div>3) I'm a little surprised, but yes, I confirmed that the Arabic shows up in the wrong direction even in my version. Looks like we'll need to do some work to make it handle right-to-left text correctly. If you want to write the patch, contact me off-list and I'll try and help you do it.</div>
<div><br></div><div>--josh</div><div><br></div><span><div style="font-family:Calibri;font-size:11pt;text-align:left;color:black;border-bottom:medium none;border-left:medium none;padding-bottom:0in;padding-left:0in;padding-right:0in;border-top:#b5c4df 1pt solid;border-right:medium none;padding-top:3pt">
<span style="font-weight:bold">From: </span> Justine Guillaumont <<a href="mailto:justine.guillaumont@gmail.com" target="_blank">justine.guillaumont@gmail.com</a>><br><span style="font-weight:bold">Date: </span> Fri, 23 Sep 2011 04:35:52 -0700<br>
<span style="font-weight:bold">To: </span> "<a href="mailto:poppler@lists.freedesktop.org" target="_blank">poppler@lists.freedesktop.org</a>" <<a href="mailto:poppler@lists.freedesktop.org" target="_blank">poppler@lists.freedesktop.org</a>><br>
<span style="font-weight:bold">Subject: </span> [poppler] pdftohtml (width-height and Arabic pdf)<br></div><div><div></div><div class="h5"><div><br></div>Hi,<br><br>It seems that the subject from my fisrt email has diverged...
I open this new subject to let you finish your conversation on the
other.<br><br>Thank you for your advice Josh. I finally succed to built the latest version of the GIT ! But my problems are the same...<br><br>1)
pdftohtml -c generate indeed xhtml but I prefer the display of
pdftohtml -s (all the pages in one html). I will keep (and modify) my
xsl to obtain xhtml with pdftohtml -s<br><br>2) the <div> I was talking about (in version 0.16.7) has been
replace by <p> in the lastest version, and they don't contain
width and height either...<br>Example : <P style="position:absolute;top:2187px;left:364px;white-space:<div>nowrap" class="ft01"><br><div><br>3) I tryed severals arabic pdf with the lastest version and I did
obtain the same results (with pdftohtml -c and pdftohtml -s) : all the
text is backwards (see enclusure). Do have one arabic pdf that has a
good rendering ?<br><br>Justine<br></div></div></div></div></span></div>
</blockquote></div><br>