[poppler] poppler util pdftohtml

Thu Sep 22 20:23:48 PDT 2011

At the end of the day it's just a tool....  What if there are more
restrictive flags in the font, but user has license for the font?  Then he
cannot use the tool?  It might be impractical to get a new version of the
font from the originator that has the bits you're looking for -- probably
just create more confusion.  See how Font Squirrel handles this:
http://www.fontsquirrel.com/fontface/generator

--josh

On 9/22/11 8:05 PM, "suzuki toshiya" <mpsuzuki at hiroshima-u.ac.jp> wrote:

>Is it acceptable that the font extraction from PDF is enabled when the
>embedded font includes OS/2 table and its fsType permits the permanent
>installation onto remote system (fsType == 0x0000)?
>
>Although the request to developers of the software generating PDF (like
>cairo, ghostscript etc) for the embedding with OS/2 table would be
>important
>to make the idea pragmatic, I think such restriction prevents the troubles
>caused by the conflicts of the understanding of font permissions.
>
>Regards,
>mpsuzuki
>
>Josh Richardson wrote:
>> The fonts that are embedded in a PDF may come from any source, and be
>> completely restriction-free.  It's really up to the user of the software
>> to decide.  Note that there are many many many other open source
>>programs
>> that extract fonts from PDFs.
>> 
>> --josh
>> 
>> On 9/22/11 6:04 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>> 
>>> Boy, your lawyer needs to read up on IP law :).
>>>
>>> Since you do NOT have a license for the font data contained in the PDF,
>>> your software has NO RIGHTS to use that information for anything other
>>> than rendering the glyphs in the PDF.  You certainly have NO rights to
>>> convert the format - in fact, doing so is a clear and distinct
>>>violation
>>> of the font licenses.
>>>
>>> As such, if your patches to pdf2html extract the font data for use in
>>>the
>>> HTML - I STRONGLY recommend that the code NOT be accepted into the
>>>master
>>> repository.
>>>
>>> Leonard
>>>
>>>
>>> On 9/22/11 6:40 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>
>>>> I'm not a lawyer, but I did check with one.  I don't think software
>>>>can
>>>> violate your IP/licenses, at least as long as that software doesn't
>>>> contain unauthorized copyrighted material -- which pdftohtml does not
>>>> AFAIK -- I certainly didn't add any to it.
>>>>
>>>> Best, --josh
>>>>
>>>> On 9/22/11 3:08 PM, "Leonard Rosenthol" <lrosenth at adobe.com> wrote:
>>>>
>>>>> I can't recall what you said about this in the past, but since I was
>>>>> just
>>>>> dealing with it today.
>>>>>
>>>>> What do you do about embedded fonts?
>>>>>
>>>>> As my company (Adobe) sells/creates fonts, I want to make sure that
>>>>> pdftohtml won't be violating our IP/licenses.
>>>>>
>>>>> Thanks in advance,
>>>>> Leonard
>>>>>
>>>>> On 9/22/11 5:51 PM, "Josh Richardson" <jric at chegg.com> wrote:
>>>>>
>>>>>> On 9/22/11 12:20 PM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
>>>>>>> More generally, it is not possible to recreate useful XHTML (or
>>>>>>> similar)
>>>>>>> documents from arbitrary PDF files with anything like 100%
>>>>>>> reliability,
>>>>>>> because many PDF files do not contain adequate information to
>>>>>>> accurately
>>>>>>> map the rendered glyphs back to correct Unicode text, or to
>>>>>>>reliably
>>>>>>> reconstruct the proper flow of text. Constructs such as ActualText
>>>>>>>may
>>>>>>> help, but are often lacking from real-world PDF documents.
>>>>>> W.r.t. rendering glyphs, we get around the problem of missing
>>>>>>unicode
>>>>>> mappings by taking any glyph without a unicode mapping and
>>>>>>assigning it
>>>>>> an
>>>>>> offset in the private space of Unicode.  This produces the correct
>>>>>> visual
>>>>>> result in the XHTML, but not a full semantic representation.  If
>>>>>> someone's
>>>>>> interested, they could get the semantics right too by
>>>>>>pattern-matching
>>>>>> the
>>>>>> glyph against an appropriate Unicode font.
>>>>>>
>>>>>> W.r.t. the flow of text, there have been other threads on this
>>>>>>topic,
>>>>>> but
>>>>>> pdftohtml does make some attempt, and I believe it's possible to do
>>>>>> this
>>>>>> to a high degree of accuracy, maybe >99% -- that said, noone has
>>>>>>done
>>>>>> it
>>>>>> yet, so either it's harder than I think, or no-one has cared enough
>>>>>>to
>>>>>> really try (and I still fall into that camp.)
>>>>>>
>>>>>> Best, --josh
>>>>>>
>>>>>> _______________________________________________
>>>>>> poppler mailing list
>>>>>> poppler at lists.freedesktop.org
>>>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>> 
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>
>