[Poppler-bugs] [Bug 60243] Evince cannot copy text from PDF documents with Computer Modern fonts

Tue Oct 22 04:15:09 PDT 2013

https://bugs.freedesktop.org/show_bug.cgi?id=60243

--- Comment #10 from Carlos Garcia Campos <carlosgc at gnome.org> ---
(In reply to comment #9)
> Created attachment 83687 [details] [review]
> Use ZapfDingbats names to locate glyphs only
> 
> (In reply to comment #8)
> > So the great question here is, how does Adobe do it right if they are
> > supossedly using the same mappings as we are?
> > 
> > Should we not use that mapping for Type 3 fonts likes the one in this bug?
> 
> I've created a few test PDFs to see how acroread uses character names.  The
> short answer is they aren't using the same mappings.
> 
> I've found that acroread ignores most character names for text extraction. 
> This includes ZapfDingbats (a1-a206), but also many others.  In total,
> acroread only uses about 700 of the 4k names in NameToUnicodeTable.h.  In
> this case, it uses the character code for text.  As stated in comment #7,
> this bug is because poppler uses ZapfDingbats names to find the text
> mapping, while acroread doesn't.
> 
> But acroread *does* use character names when finding the glyph to display
> (in most cases - there seems to be some special treatment if the base font
> is ZapfDingbats).  I think it has less trouble finding the correct glyph
> because it brings along its own fonts.
> 
> For text extraction: if we try to emulate acroread too closely, some PDFs
> show regressions with text extraction.  Mostly documents with mathematical
> symbols and a couple which include names like f.alt, uniFB00, or g84. 
> Poppler parses these names, changes "f.alt" into "f" and looks it up through
> NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode
> values.  As far as I can tell, acroread ignores these names and just uses
> the character code.
> 
> The ZapfDingbats names are problematic because they are so generic. It is
> unlikely that a PDF producer would choose "omega" as a name unless it really
> wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a
> ZapfDingbats name like a102 and expecting a reader to use the character code
> like acroread, or Unicode value 102 like poppler.
> 
> From bug #13131, it looks like the ZapfDingbats mappings are useful for
> locating glyphs, but this bug shows they shouldn't be used for text
> extraction.  The attached patch moves the ZapfDingbats names in
> NameToUnicodeTable.h into a separate table and separates looking up Unicode
> values for text and for glyph IDs.

It makes sense to me. Albert, could you pass the tests with the patch?

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131022/b34cd82b/attachment.html>