[Poppler-bugs] [Bug 60243] Evince cannot copy text from PDF documents with Computer Modern fonts

Mon Aug 5 19:15:43 PDT 2013

https://bugs.freedesktop.org/show_bug.cgi?id=60243

--- Comment #9 from Jason Crain <jason at aquaticape.us> ---
Created attachment 83687
  --> https://bugs.freedesktop.org/attachment.cgi?id=83687&action=edit
Use ZapfDingbats names to locate glyphs only

(In reply to comment #8)
> So the great question here is, how does Adobe do it right if they are
> supossedly using the same mappings as we are?
> 
> Should we not use that mapping for Type 3 fonts likes the one in this bug?

I've created a few test PDFs to see how acroread uses character names.  The
short answer is they aren't using the same mappings.

I've found that acroread ignores most character names for text extraction. 
This includes ZapfDingbats (a1-a206), but also many others.  In total, acroread
only uses about 700 of the 4k names in NameToUnicodeTable.h.  In this case, it
uses the character code for text.  As stated in comment #7, this bug is because
poppler uses ZapfDingbats names to find the text mapping, while acroread
doesn't.

But acroread *does* use character names when finding the glyph to display (in
most cases - there seems to be some special treatment if the base font is
ZapfDingbats).  I think it has less trouble finding the correct glyph because
it brings along its own fonts.

For text extraction: if we try to emulate acroread too closely, some PDFs show
regressions with text extraction.  Mostly documents with mathematical symbols
and a couple which include names like f.alt, uniFB00, or g84.  Poppler parses
these names, changes "f.alt" into "f" and looks it up through
NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode
values.  As far as I can tell, acroread ignores these names and just uses the
character code.

The ZapfDingbats names are problematic because they are so generic. It is
unlikely that a PDF producer would choose "omega" as a name unless it really
wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a
ZapfDingbats name like a102 and expecting a reader to use the character code
like acroread, or Unicode value 102 like poppler.

>From bug #13131, it looks like the ZapfDingbats mappings are useful for
locating glyphs, but this bug shows they shouldn't be used for text extraction.
 The attached patch moves the ZapfDingbats names in NameToUnicodeTable.h into a
separate table and separates looking up Unicode values for text and for glyph
IDs.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20130806/a94262bc/attachment.html>