[poppler] poppler util pdftohtml

Fri Sep 23 03:38:30 PDT 2011

On 22 Sep 2011, at 22:51, Josh Richardson wrote:

> On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>> More generally, it is not possible to recreate useful XHTML (or similar)
>> documents from arbitrary PDF files with anything like 100% reliability,
>> because many PDF files do not contain adequate information to accurately
>> map the rendered glyphs back to correct Unicode text, or to reliably
>> reconstruct the proper flow of text. Constructs such as ActualText may
>> help, but are often lacking from real-world PDF documents.
> 
> W.r.t. rendering glyphs, we get around the problem of missing unicode
> mappings by taking any glyph without a unicode mapping and assigning it an
> offset in the private space of Unicode.  This produces the correct visual
> result in the XHTML, but not a full semantic representation.

In such cases the XHTML isn't really any more useful, for most purposes, than the original PDF. The content isn't usefully searchable, editable, or interoperable with pretty much anything...

>  If someone's
> interested, they could get the semantics right too by pattern-matching the
> glyph against an appropriate Unicode font.

Not in general; there's far too much variation in glyph appearance, and far too many "visually confusable" characters.

> 
> W.r.t. the flow of text, there have been other threads on this topic, but
> pdftohtml does make some attempt, and I believe it's possible to do this
> to a high degree of accuracy, maybe >99% -- that said, noone has done it
> yet, so either it's harder than I think, or no-one has cared enough to
> really try (and I still fall into that camp.)

I suspect it's harder than you think. For example, given a single line that contains some English and some Hebrew, it is impossible to unambiguously reconstruct the order of the underlying text from the "visual order" that is all you may have to work with. If the PDF simply displays

    english WERBEH

where the uppercase letters represent Hebrew glyphs, should this be read as the Latin-script, left-to-right word "english" followed by the Hebrew-script, right-to-left word "HEBREW", or vice versa? You cannot tell without higher-level structural information that has no visual representation and therefore may well not be present at all in the PDF data stream, which (depending how it was generated) is likely to be a "flattened", strictly left-to-right representation of the final glyph layout, not of the logical flow of the text.

Once you start dealing with whole paragraphs, multiple columns, table cells, etc, etc, things only get worse.... you may get good results for a limited class of documents (e.g. unidirectional LTR text, fairly simple block layouts), but the general problem for arbitrary PDF documents is MUCH harder.

JK