[poppler] poppler util pdftohtml

suzuki toshiya mpsuzuki at hiroshima-u.ac.jp
Thu Sep 22 20:05:59 PDT 2011

Is it acceptable that the font extraction from PDF is enabled when the
embedded font includes OS/2 table and its fsType permits the permanent
installation onto remote system (fsType == 0x0000)?

Although the request to developers of the software generating PDF (like
cairo, ghostscript etc) for the embedding with OS/2 table would be important
to make the idea pragmatic, I think such restriction prevents the troubles
caused by the conflicts of the understanding of font permissions.


Josh Richardson wrote:
> The fonts that are embedded in a PDF may come from any source, and be
> completely restriction-free.  It's really up to the user of the software
> to decide.  Note that there are many many many other open source programs
> that extract fonts from PDFs.
> --josh
> On 9/22/11 6:04 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>> Boy, your lawyer needs to read up on IP law :).
>> Since you do NOT have a license for the font data contained in the PDF,
>> your software has NO RIGHTS to use that information for anything other
>> than rendering the glyphs in the PDF.  You certainly have NO rights to
>> convert the format - in fact, doing so is a clear and distinct violation
>> of the font licenses.
>> As such, if your patches to pdf2html extract the font data for use in the
>> HTML - I STRONGLY recommend that the code NOT be accepted into the master
>> repository.
>> Leonard
>> On 9/22/11 6:40 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>> I'm not a lawyer, but I did check with one.  I don't think software can
>>> violate your IP/licenses, at least as long as that software doesn't
>>> contain unauthorized copyrighted material -- which pdftohtml does not
>>> AFAIK -- I certainly didn't add any to it.
>>> Best, --josh
>>> On 9/22/11 3:08 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>>>> I can't recall what you said about this in the past, but since I was
>>>> just
>>>> dealing with it today.
>>>> What do you do about embedded fonts?
>>>> As my company (Adobe) sells/creates fonts, I want to make sure that
>>>> pdftohtml won't be violating our IP/licenses.
>>>> Thanks in advance,
>>>> Leonard
>>>> On 9/22/11 5:51 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>>> On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>>>>>> More generally, it is not possible to recreate useful XHTML (or
>>>>>> similar)
>>>>>> documents from arbitrary PDF files with anything like 100%
>>>>>> reliability,
>>>>>> because many PDF files do not contain adequate information to
>>>>>> accurately
>>>>>> map the rendered glyphs back to correct Unicode text, or to reliably
>>>>>> reconstruct the proper flow of text. Constructs such as ActualText may
>>>>>> help, but are often lacking from real-world PDF documents.
>>>>> W.r.t. rendering glyphs, we get around the problem of missing unicode
>>>>> mappings by taking any glyph without a unicode mapping and assigning it
>>>>> an
>>>>> offset in the private space of Unicode.  This produces the correct
>>>>> visual
>>>>> result in the XHTML, but not a full semantic representation.  If
>>>>> someone's
>>>>> interested, they could get the semantics right too by pattern-matching
>>>>> the
>>>>> glyph against an appropriate Unicode font.
>>>>> W.r.t. the flow of text, there have been other threads on this topic,
>>>>> but
>>>>> pdftohtml does make some attempt, and I believe it's possible to do
>>>>> this
>>>>> to a high degree of accuracy, maybe >99% -- that said, noone has done
>>>>> it
>>>>> yet, so either it's harder than I think, or no-one has cared enough to
>>>>> really try (and I still fall into that camp.)
>>>>> Best, --josh
>>>>> _______________________________________________
>>>>> poppler mailing list
>>>>> poppler at lists.freedesktop.org
>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/poppler
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler

More information about the poppler mailing list