[poppler] poppler util pdftohtml

Thu Sep 22 12:20:45 PDT 2011

On 22 Sep 2011, at 20:03, Josh Richardson wrote:

> Justine, I have done some recent work on the pdftohtml utility.  I recommend grabbing the latest version off of GIT and building it.  It's not too hard.
> 
> The latest version of pdftohtml generates valid XHTML and has the width and height in the body div.
> 
> It may also be that your problem with Arabic is fixed in the latest version (make sure you use "complex" option), because I don't know how backwards rendering would possibly happen with that version, but if it does, I'd love to know more about it.  I would expect it to write out the characters in the same order in which they are printed on the page, from left to write.  This could cause issues with ligatures, however.
> 

Outputting Arabic characters in left to right order is a serious problem, because the script is written from right to left. If you generate the characters LTR, as stated, then the text *will* be "backwards", and will display incorrectly in any application that actually supports RTL text. (Unless you surround them with Unicode directional-override characters or with markup that forces LTR directionality - but that would make for a very awkward-to-use document for most purposes.)

More generally, it is not possible to recreate useful XHTML (or similar) documents from arbitrary PDF files with anything like 100% reliability, because many PDF files do not contain adequate information to accurately map the rendered glyphs back to correct Unicode text, or to reliably reconstruct the proper flow of text. Constructs such as ActualText may help, but are often lacking from real-world PDF documents.

*If* you can carefully control the details of the PDF generation process, it may be possible to do such a reconstruction (but if you control the PDF generation, why don't you just keep a copy of the original "source" document on hand?); but if you want to accept *any* PDF and generate XHTML, then the results will be of variable quality, and cannot be regarded as more than a "rough draft" that requires careful review.

JK

> --josh
> 
> From: Justine Guillaumont <justine.guillaumont at gmail.com>
> Date: Thu, 22 Sep 2011 04:23:02 -0700
> To: "poppler at lists.freedesktop.org" <poppler at lists.freedesktop.org>
> Subject: [poppler] poppler util pdftohtml
> 
> Hi all,
> 
> My name is Justine Guillaumont, I am completing  my engineering studies by a 6-months internship. I am working on the opensource project WebLab (weblab-project.org).
> I am currently using poppler-0.16.7 (I tried to install poppler-0.17.4 but libpoppler.so.17 is missing).
> One of the purposes of my internship is to transform PDF files into XHTML files that will give the same structured display. In order to doing this, I use pdftohtml -nodrm -p -s (to obtain HTML) and then a script and XSL (to obtain XHTML). I encountered several problems with pdftohtml that I would like to share in order to have your opinion.
> 
> 1) Would it be possible to have the width and height of the tag DIV in the BODY ?
> I noticed that with have it with pdftohtml -xml (in the tags TEXT) but not with pdftohtml -nodrm -p -s. I tried to modifiy your code (HtmlOutputDev.cc) but I only "sucess" to collect the width and height of the first word of the DIV.
> 
> 2) The HTML generate by pdftohtml is not validated by W3C (http://validator.w3.org/)
> It is sad because you don't have much to modify to obtain valid HTML 4 or XHTML. If you like, I can send you the xsl I made to transform the HTML generate by pdftohtml -p -s into valid HTML4.
> 
> 3) With arabic PDF, pdftohtml seems to read correctely the PDF (from rigth to left) and to write the HTML upside-down / backwards (from left to right). All words are reversed. Would that be corrected soon ?
> Please find attached an example of this problem.
> 
> Regards,
> 
> Justine Guillaumont
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler