[poppler] poppler util pdftohtml

Thu Sep 22 14:51:59 PDT 2011

On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>More generally, it is not possible to recreate useful XHTML (or similar)
>documents from arbitrary PDF files with anything like 100% reliability,
>because many PDF files do not contain adequate information to accurately
>map the rendered glyphs back to correct Unicode text, or to reliably
>reconstruct the proper flow of text. Constructs such as ActualText may
>help, but are often lacking from real-world PDF documents.

W.r.t. rendering glyphs, we get around the problem of missing unicode
mappings by taking any glyph without a unicode mapping and assigning it an
offset in the private space of Unicode.  This produces the correct visual
result in the XHTML, but not a full semantic representation.  If someone's
interested, they could get the semantics right too by pattern-matching the
glyph against an appropriate Unicode font.

W.r.t. the flow of text, there have been other threads on this topic, but
pdftohtml does make some attempt, and I believe it's possible to do this
to a high degree of accuracy, maybe >99% -- that said, noone has done it
yet, so either it's harder than I think, or no-one has cared enough to
really try (and I still fall into that camp.)

Best, --josh