[poppler] pdftohtml (width-height and Arabic pdf)

Justine Guillaumont justine.guillaumont at gmail.com
Mon Sep 26 03:14:04 PDT 2011

1) I compared the rendering of pdftohtml with [-c], [-s] and [-c -s].
The options -c -s don't generate xhtml because you're putting several <html>
in the html file (like with -s). You could just merge the contents of
<style>, do the same for the contents of <body> and then obtain an xhtml
I tried to modify your code to do it but I really didn't succeed to handle
it... This is why I did a XSL and not a patch. Maybe you will succeed to do
it. Would it be easy for you ?

Did you notice that using -c give a different cuttering of the text (more
precise) and a better rendering of the font-size, font-color and text-alig ?
Is it normal ?

2) I did look carefully at your file and it suits me well ! I'm looking
forward using your next stable version !

3) Like I said in 1), I don't handle your code, but I hope you will find how
to manage right-to-left text !


2011/9/24 Josh Richardson <jric at chegg.com>

> Sorry for the delay — been on an airplane all day — and had a lot of emails
> to read on the list.  ;-)
> 1)  You can use both –s and –c at the same time.
> 2) Ok, was worth a shot.  I've lost track a little bit where the code base
> is — I haven't yet contributed back everything, just because it takes time
> to format the patches.  I definitely have code that embeds the size of each
> paragraph — well, at least I think it's what you want.  I've attached a
> sample file — let me know.
> 3) I'm a little surprised, but yes, I confirmed that the Arabic shows up in
> the wrong direction even in my version.  Looks like we'll need to do some
> work to make it handle right-to-left text correctly.  If you want to write
> the patch, contact me off-list and I'll try and help you do it.
> --josh
> From: Justine Guillaumont <justine.guillaumont at gmail.com>
> Date: Fri, 23 Sep 2011 04:35:52 -0700
> To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
> Subject: [poppler] pdftohtml (width-height and Arabic pdf)
> Hi,
> It seems that the subject from my fisrt email has diverged... I open this
> new subject to let you finish your conversation on the other.
> Thank you for your advice Josh. I finally succed to built the latest
> version of the GIT ! But my problems are the same...
> 1) pdftohtml -c generate indeed xhtml but I prefer the display of pdftohtml
> -s (all the pages in one html). I will keep (and modify) my xsl to obtain
> xhtml with pdftohtml -s
> 2) the <div> I was talking about (in version 0.16.7) has been replace by
> <p> in the lastest version, and they don't contain width and height
> either...
> Example : <P style="position:absolute;top:2187px;left:364px;white-space:
> nowrap" class="ft01">
> 3) I tryed severals arabic pdf with the lastest version and I did obtain
> the same results (with pdftohtml -c and pdftohtml -s) : all the text is
> backwards (see enclusure). Do have one arabic pdf that has a good rendering
> ?
> Justine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20110926/38e96460/attachment.html>

More information about the poppler mailing list