[Poppler-bugs] [Bug 39385] pdftohtml: add image and font extraction

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Aug 30 11:48:29 PDT 2011


https://bugs.freedesktop.org/show_bug.cgi?id=39385

--- Comment #7 from Joshua Richardson <joshuarbox-junk1 at yahoo.com> 2011-08-30 11:48:29 PDT ---
(In reply to comment #6)
> 1) Basically, the embedded font extraction feature seems to be
> designed to save a embedded font stream as a separated file.
> If the embedded font lacks the essential tables that are required
> to stand as a self-standing font file (e.g. embedded TrueType
> may lack "cmap" table), it is out of scope? Sorry, I don't mean
> and I'm not requesting that such fonts should be supported,
> just I want to ask.
No, this is not out of scope for me.  If I have not already, I will soon submit
new patches which export other font data, which can be used to create a
free-standing font.

> 2) I'm not so familiar with the coverage of the font formats supported
> by HTML rendering systems, but, I guess, some font formats are
> not so widely supported by most web browsers. I'm not against the idea
> to extract all possible fonts from PDF (extract all is far intuitive than
> selective extraction), but some warning would be expected for users
> to indicate that "font XXX is extracted but your HTML browser may
> not be able to use it".
I like the idea, but I think it's the user's responsibility to understand the
options that he uses.  I don't want to punish users who understand what they're
doing with an annoying warning message.  Once the fonts are extracted, they
must be converted to formats that the various browsers can use in order to work
on those browsers.  Thank you for reminding me to document this.  I will put it
into the utils/README.pdftohtml document in a future patch.

> * Type3: In Type3 embedded in PDF, any PDF graphic operations
> can be used. Thus, to render Type3 in PDF, yet another PDF rendering
> system is required. Considering that most web browsers don't have
> their builtin PDF renderer, Type3 won't be able to be used correctly.
> In fact, FreeType font rasterizer does not support PS or PDF Type3.
Yes, that is currently a limitation.  Luckily, Type 3 fonts are exceedingly
rare in the domain I care about.  The best solution I can think of is to do any
drawing operations with Type 3 fonts as an image, instead of extracting as
text.  Then use "alt" text in the HTML for that image.

> * CIDType0, CIDType2: Maybe you know that CID-keyed font is designed
> to be used with CMap resource to translate the character code to CID
> number: the glyph identifier in CID-keyed font), and CMap may or may
> not be embedded in PDF document. Thus, it is possible to say CID-keyed
> font is not self-standing. Although there had ever been a patch for
> FreeType2 to combine CID-keyed font & CMap and make a self-standing
> face object long ago, it is not adopted yet (not refused but considered as
> "more TODO"). I'm afraid that most web browsers assume the simple
> font loading mechanism like "giving a font file pathname, and an indice
> to specify the face in TTC (or font suitcase) if required, then get a self-
> standing face object", so they cannot support CIDType0 or CIDType2.
For my application, we are converting everything to Unicode (default pdftohtml
encoding).  In later patches we have modified pdftohtml to ensure that
everything has a valid and unique unicode mapping.  Then, we use a FontForge
script to ensure that the font contains that mapping when converting it to
browser-compatible formats.

> * CFF: I think most FreeType based applications don't distinguish
> CFF from other PS Type1 fonts (PFA/PFB), but Microsoft Windows
> supports only PFB, PFA and CFF are not supported (although OpenType
> including CFF is supported!). Yet I've not checked Mac OS X.
> One of the problem is that Adobe Acrobat (on Microsoft Windows)
> transform PFB fonts to CFF fonts when it embeddes PFB fonts to
> PDF, oops.
These patches assume that all fonts will be converted to appropriate type for
the given browsers.  WOFF for good browsers, EOT for IE, etc.

> 3) About GfxFont::getFileExtension(), some correction is recommended.
> Especially, most font formats designed for PostScript language lack the
> definition of the standard suffixes. I think...
> 
> * the extension for "fontType1" may be "pfa" or "pfb". although I
> don't have good referential PDF generator that embeddes PS Type1 as
> PS Type1, checking the header is recommended to determine appropriate
> suffix. However, I'm not sure if there is a software going wrong when PFB
> fonts are given with the suffix PFA. FreeType does not care, and, most
> systems caring PFA or PFB are supposed to be the systems supporting
> PFB only or PFA only.
Thanks, I'll keep this in mind if we run into trouble.

> * the extension for "fontType3" is not officially standardized by the spec
> author.
>
> * the extension for "CIDType0" is not officially standardized by the spec
> author.
> 
> * the extension for "CIDType2" is not officially standardized by the spec
> author.
Good to know.  Hopefully these also don't create problems.

> * the extension ".otf" must be used for the font including "CFF" table,
> so it should not be used for "fontTrueTypeOT", "fontCIDType2OT" that
> use "glyf" instead of "CFF ".
Can you give me a reference here?  I thought that ".otf" stands for "Open Type
Font" and that both "fontTrueTypeOT" and "fontCIDType2OT" are variants of the
Open Type.

Thanks for your help!!!

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list