[poppler] poppler util pdftohtml

Leonard Rosenthol lrosenth at adobe.com
Fri Sep 23 03:57:43 PDT 2011

They may, you are right.

If you wanted to maintain a list of known "restriction free" fonts and
only extract those - that would probably be OK.


On 9/22/11 9:17 PM, "Josh Richardson" <jric at chegg.com> wrote:

>The fonts that are embedded in a PDF may come from any source, and be
>completely restriction-free.  It's really up to the user of the software
>to decide.  Note that there are many many many other open source programs
>that extract fonts from PDFs.
>On 9/22/11 6:04 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>>Boy, your lawyer needs to read up on IP law :).
>>Since you do NOT have a license for the font data contained in the PDF,
>>your software has NO RIGHTS to use that information for anything other
>>than rendering the glyphs in the PDF.  You certainly have NO rights to
>>convert the format - in fact, doing so is a clear and distinct violation
>>of the font licenses.
>>As such, if your patches to pdf2html extract the font data for use in the
>>HTML - I STRONGLY recommend that the code NOT be accepted into the master
>>On 9/22/11 6:40 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>I'm not a lawyer, but I did check with one.  I don't think software can
>>>violate your IP/licenses, at least as long as that software doesn't
>>>contain unauthorized copyrighted material -- which pdftohtml does not
>>>AFAIK -- I certainly didn't add any to it.
>>>Best, --josh
>>>On 9/22/11 3:08 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>>>>I can't recall what you said about this in the past, but since I was
>>>>dealing with it today.
>>>>What do you do about embedded fonts?
>>>>As my company (Adobe) sells/creates fonts, I want to make sure that
>>>>pdftohtml won't be violating our IP/licenses.
>>>>Thanks in advance,
>>>>On 9/22/11 5:51 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>>>On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>>>>>>More generally, it is not possible to recreate useful XHTML (or
>>>>>>documents from arbitrary PDF files with anything like 100%
>>>>>>because many PDF files do not contain adequate information to
>>>>>>map the rendered glyphs back to correct Unicode text, or to reliably
>>>>>>reconstruct the proper flow of text. Constructs such as ActualText
>>>>>>help, but are often lacking from real-world PDF documents.
>>>>>W.r.t. rendering glyphs, we get around the problem of missing unicode
>>>>>mappings by taking any glyph without a unicode mapping and assigning
>>>>>offset in the private space of Unicode.  This produces the correct
>>>>>result in the XHTML, but not a full semantic representation.  If
>>>>>interested, they could get the semantics right too by pattern-matching
>>>>>glyph against an appropriate Unicode font.
>>>>>W.r.t. the flow of text, there have been other threads on this topic,
>>>>>pdftohtml does make some attempt, and I believe it's possible to do
>>>>>to a high degree of accuracy, maybe >99% -- that said, noone has done
>>>>>yet, so either it's harder than I think, or no-one has cared enough to
>>>>>really try (and I still fall into that camp.)
>>>>>Best, --josh
>>>>>poppler mailing list
>>>>>poppler at lists.freedesktop.org
>>>poppler mailing list
>>>poppler at lists.freedesktop.org

More information about the poppler mailing list