[Poppler-bugs] [Bug 39385] pdftohtml: add image and font extraction

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Aug 28 20:06:04 PDT 2011


https://bugs.freedesktop.org/show_bug.cgi?id=39385

--- Comment #6 from suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> 2011-08-28 20:06:03 PDT ---
Dear Josh,

I really apologize that I have been (and still am) unable to help
for your great effort to improve pdftohtml.

Just I've checked the support code of embedded font feature,
and I want to ask a question and give a few comments.

1) Basically, the embedded font extraction feature seems to be
designed to save a embedded font stream as a separated file.
If the embedded font lacks the essential tables that are required
to stand as a self-standing font file (e.g. embedded TrueType
may lack "cmap" table), it is out of scope? Sorry, I don't mean
and I'm not requesting that such fonts should be supported,
just I want to ask.

2) I'm not so familiar with the coverage of the font formats supported
by HTML rendering systems, but, I guess, some font formats are
not so widely supported by most web browsers. I'm not against the idea
to extract all possible fonts from PDF (extract all is far intuitive than
selective extraction), but some warning would be expected for users
to indicate that "font XXX is extracted but your HTML browser may
not be able to use it".

* Type3: In Type3 embedded in PDF, any PDF graphic operations
can be used. Thus, to render Type3 in PDF, yet another PDF rendering
system is required. Considering that most web browsers don't have
their builtin PDF renderer, Type3 won't be able to be used correctly.
In fact, FreeType font rasterizer does not support PS or PDF Type3.

* CIDType0, CIDType2: Maybe you know that CID-keyed font is designed
to be used with CMap resource to translate the character code to CID
number: the glyph identifier in CID-keyed font), and CMap may or may
not be embedded in PDF document. Thus, it is possible to say CID-keyed
font is not self-standing. Although there had ever been a patch for
FreeType2 to combine CID-keyed font & CMap and make a self-standing
face object long ago, it is not adopted yet (not refused but considered as
"more TODO"). I'm afraid that most web browsers assume the simple
font loading mechanism like "giving a font file pathname, and an indice
to specify the face in TTC (or font suitcase) if required, then get a self-
standing face object", so they cannot support CIDType0 or CIDType2.

* CFF: I think most FreeType based applications don't distinguish
CFF from other PS Type1 fonts (PFA/PFB), but Microsoft Windows
supports only PFB, PFA and CFF are not supported (although OpenType
including CFF is supported!). Yet I've not checked Mac OS X.
One of the problem is that Adobe Acrobat (on Microsoft Windows)
transform PFB fonts to CFF fonts when it embeddes PFB fonts to
PDF, oops.

3) About GfxFont::getFileExtension(), some correction is recommended.
Especially, most font formats designed for PostScript language lack the
definition of the standard suffixes. I think...

* the extension for "fontType1" may be "pfa" or "pfb". although I
don't have good referential PDF generator that embeddes PS Type1 as
PS Type1, checking the header is recommended to determine appropriate
suffix. However, I'm not sure if there is a software going wrong when PFB
fonts are given with the suffix PFA. FreeType does not care, and, most
systems caring PFA or PFB are supposed to be the systems supporting
PFB only or PFA only.

* the extension for "fontType3" is not officially standardized by the spec
author.

* the extension for "CIDType0" is not officially standardized by the spec
author.

* the extension for "CIDType2" is not officially standardized by the spec
author.

* the extension ".otf" must be used for the font including "CFF" table,
so it should not be used for "fontTrueTypeOT", "fontCIDType2OT" that
use "glyf" instead of "CFF ".

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


More information about the Poppler-bugs mailing list