[poppler] poppler util pdftohtml

Thu Sep 22 15:08:22 PDT 2011

I can't recall what you said about this in the past, but since I was just
dealing with it today.

What do you do about embedded fonts?

As my company (Adobe) sells/creates fonts, I want to make sure that
pdftohtml won't be violating our IP/licenses.

Thanks in advance,
Leonard

On 9/22/11 5:51 PM, "Josh Richardson" <jric at chegg.com> wrote:

>On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>>More generally, it is not possible to recreate useful XHTML (or similar)
>>documents from arbitrary PDF files with anything like 100% reliability,
>>because many PDF files do not contain adequate information to accurately
>>map the rendered glyphs back to correct Unicode text, or to reliably
>>reconstruct the proper flow of text. Constructs such as ActualText may
>>help, but are often lacking from real-world PDF documents.
>
>W.r.t. rendering glyphs, we get around the problem of missing unicode
>mappings by taking any glyph without a unicode mapping and assigning it an
>offset in the private space of Unicode.  This produces the correct visual
>result in the XHTML, but not a full semantic representation.  If someone's
>interested, they could get the semantics right too by pattern-matching the
>glyph against an appropriate Unicode font.
>
>W.r.t. the flow of text, there have been other threads on this topic, but
>pdftohtml does make some attempt, and I believe it's possible to do this
>to a high degree of accuracy, maybe >99% -- that said, noone has done it
>yet, so either it's harder than I think, or no-one has cared enough to
>really try (and I still fall into that camp.)
>
>Best, --josh
>
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler