[poppler] poppler util pdftohtml

Josh Richardson jric at chegg.com
Thu Sep 22 18:17:16 PDT 2011

The fonts that are embedded in a PDF may come from any source, and be
completely restriction-free.  It's really up to the user of the software
to decide.  Note that there are many many many other open source programs
that extract fonts from PDFs.


On 9/22/11 6:04 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:

>Boy, your lawyer needs to read up on IP law :).
>Since you do NOT have a license for the font data contained in the PDF,
>your software has NO RIGHTS to use that information for anything other
>than rendering the glyphs in the PDF.  You certainly have NO rights to
>convert the format - in fact, doing so is a clear and distinct violation
>of the font licenses.
>As such, if your patches to pdf2html extract the font data for use in the
>HTML - I STRONGLY recommend that the code NOT be accepted into the master
>On 9/22/11 6:40 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>I'm not a lawyer, but I did check with one.  I don't think software can
>>violate your IP/licenses, at least as long as that software doesn't
>>contain unauthorized copyrighted material -- which pdftohtml does not
>>AFAIK -- I certainly didn't add any to it.
>>Best, --josh
>>On 9/22/11 3:08 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>>>I can't recall what you said about this in the past, but since I was
>>>dealing with it today.
>>>What do you do about embedded fonts?
>>>As my company (Adobe) sells/creates fonts, I want to make sure that
>>>pdftohtml won't be violating our IP/licenses.
>>>Thanks in advance,
>>>On 9/22/11 5:51 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>>On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>>>>>More generally, it is not possible to recreate useful XHTML (or
>>>>>documents from arbitrary PDF files with anything like 100%
>>>>>because many PDF files do not contain adequate information to
>>>>>map the rendered glyphs back to correct Unicode text, or to reliably
>>>>>reconstruct the proper flow of text. Constructs such as ActualText may
>>>>>help, but are often lacking from real-world PDF documents.
>>>>W.r.t. rendering glyphs, we get around the problem of missing unicode
>>>>mappings by taking any glyph without a unicode mapping and assigning it
>>>>offset in the private space of Unicode.  This produces the correct
>>>>result in the XHTML, but not a full semantic representation.  If
>>>>interested, they could get the semantics right too by pattern-matching
>>>>glyph against an appropriate Unicode font.
>>>>W.r.t. the flow of text, there have been other threads on this topic,
>>>>pdftohtml does make some attempt, and I believe it's possible to do
>>>>to a high degree of accuracy, maybe >99% -- that said, noone has done
>>>>yet, so either it's harder than I think, or no-one has cared enough to
>>>>really try (and I still fall into that camp.)
>>>>Best, --josh
>>>>poppler mailing list
>>>>poppler at lists.freedesktop.org
>>poppler mailing list
>>poppler at lists.freedesktop.org

More information about the poppler mailing list