[poppler] poppler util pdftohtml

Leonard Rosenthol lrosenth at adobe.com
Fri Sep 23 03:57:00 PDT 2011


For simple extraction BY THE FONT LICENSEE, yes.

And thus is the crux of the matter.

The person running the PDF->HTML conversion software, in the vast majority
of the cases, is NOT the person who created the PDF and therefore does NOT
have the rights to do ANYTHING with that font data (other than having it
viewed).

Leonard

On 9/22/11 11:05 PM, "suzuki toshiya" <mpsuzuki at hiroshima-u.ac.jp> wrote:

>Is it acceptable that the font extraction from PDF is enabled when the
>embedded font includes OS/2 table and its fsType permits the permanent
>installation onto remote system (fsType == 0x0000)?
>
>Although the request to developers of the software generating PDF (like
>cairo, ghostscript etc) for the embedding with OS/2 table would be
>important
>to make the idea pragmatic, I think such restriction prevents the troubles
>caused by the conflicts of the understanding of font permissions.
>
>Regards,
>mpsuzuki
>
>Josh Richardson wrote:
>> The fonts that are embedded in a PDF may come from any source, and be
>> completely restriction-free.  It's really up to the user of the software
>> to decide.  Note that there are many many many other open source
>>programs
>> that extract fonts from PDFs.
>> 
>> --josh
>> 
>> On 9/22/11 6:04 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>> 
>>> Boy, your lawyer needs to read up on IP law :).
>>>
>>> Since you do NOT have a license for the font data contained in the PDF,
>>> your software has NO RIGHTS to use that information for anything other
>>> than rendering the glyphs in the PDF.  You certainly have NO rights to
>>> convert the format - in fact, doing so is a clear and distinct
>>>violation
>>> of the font licenses.
>>>
>>> As such, if your patches to pdf2html extract the font data for use in
>>>the
>>> HTML - I STRONGLY recommend that the code NOT be accepted into the
>>>master
>>> repository.
>>>
>>> Leonard
>>>
>>>
>>> On 9/22/11 6:40 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>
>>>> I'm not a lawyer, but I did check with one.  I don't think software
>>>>can
>>>> violate your IP/licenses, at least as long as that software doesn't
>>>> contain unauthorized copyrighted material -- which pdftohtml does not
>>>> AFAIK -- I certainly didn't add any to it.
>>>>
>>>> Best, --josh
>>>>
>>>> On 9/22/11 3:08 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>>>>
>>>>> I can't recall what you said about this in the past, but since I was
>>>>> just
>>>>> dealing with it today.
>>>>>
>>>>> What do you do about embedded fonts?
>>>>>
>>>>> As my company (Adobe) sells/creates fonts, I want to make sure that
>>>>> pdftohtml won't be violating our IP/licenses.
>>>>>
>>>>> Thanks in advance,
>>>>> Leonard
>>>>>
>>>>> On 9/22/11 5:51 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>>>
>>>>>> On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>>>>>>> More generally, it is not possible to recreate useful XHTML (or
>>>>>>> similar)
>>>>>>> documents from arbitrary PDF files with anything like 100%
>>>>>>> reliability,
>>>>>>> because many PDF files do not contain adequate information to
>>>>>>> accurately
>>>>>>> map the rendered glyphs back to correct Unicode text, or to
>>>>>>>reliably
>>>>>>> reconstruct the proper flow of text. Constructs such as ActualText
>>>>>>>may
>>>>>>> help, but are often lacking from real-world PDF documents.
>>>>>> W.r.t. rendering glyphs, we get around the problem of missing
>>>>>>unicode
>>>>>> mappings by taking any glyph without a unicode mapping and
>>>>>>assigning it
>>>>>> an
>>>>>> offset in the private space of Unicode.  This produces the correct
>>>>>> visual
>>>>>> result in the XHTML, but not a full semantic representation.  If
>>>>>> someone's
>>>>>> interested, they could get the semantics right too by
>>>>>>pattern-matching
>>>>>> the
>>>>>> glyph against an appropriate Unicode font.
>>>>>>
>>>>>> W.r.t. the flow of text, there have been other threads on this
>>>>>>topic,
>>>>>> but
>>>>>> pdftohtml does make some attempt, and I believe it's possible to do
>>>>>> this
>>>>>> to a high degree of accuracy, maybe >99% -- that said, noone has
>>>>>>done
>>>>>> it
>>>>>> yet, so either it's harder than I think, or no-one has cared enough
>>>>>>to
>>>>>> really try (and I still fall into that camp.)
>>>>>>
>>>>>> Best, --josh
>>>>>>
>>>>>> _______________________________________________
>>>>>> poppler mailing list
>>>>>> poppler at lists.freedesktop.org
>>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>> 
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>
>_______________________________________________
>poppler mailing list
>poppler at lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/poppler



More information about the poppler mailing list