<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW --- - Evince cannot copy text from PDF documents with Computer Modern fonts" href="https://bugs.freedesktop.org/show_bug.cgi?id=60243#c9">Comment # 9</a> on <a class="bz_bug_link bz_status_NEW " title="NEW --- - Evince cannot copy text from PDF documents with Computer Modern fonts" href="https://bugs.freedesktop.org/show_bug.cgi?id=60243">bug 60243</a> from <a class="email" href="mailto:jason@aquaticape.us" title="Jason Crain <jason@aquaticape.us>"> Jason Crain</a> <pre>Created <a href="attachment.cgi?id=83687" name="attach_83687" title="Use ZapfDingbats names to locate glyphs only">attachment 83687</a> <a href="attachment.cgi?id=83687&action=edit" title="Use ZapfDingbats names to locate glyphs only">[details]</a> <a href='page.cgi?id=splinter.html&bug=60243&attachment=83687'>[review]</a> Use ZapfDingbats names to locate glyphs only (In reply to <a href="show_bug.cgi?id=60243#c8">comment #8</a>) > So the great question here is, how does Adobe do it right if they are > supossedly using the same mappings as we are? > > Should we not use that mapping for Type 3 fonts likes the one in this bug? I've created a few test PDFs to see how acroread uses character names. The short answer is they aren't using the same mappings. I've found that acroread ignores most character names for text extraction. This includes ZapfDingbats (a1-a206), but also many others. In total, acroread only uses about 700 of the 4k names in NameToUnicodeTable.h. In this case, it uses the character code for text. As stated in <a href="show_bug.cgi?id=60243#c7">comment #7</a>, this bug is because poppler uses ZapfDingbats names to find the text mapping, while acroread doesn't. But acroread *does* use character names when finding the glyph to display (in most cases - there seems to be some special treatment if the base font is ZapfDingbats). I think it has less trouble finding the correct glyph because it brings along its own fonts. For text extraction: if we try to emulate acroread too closely, some PDFs show regressions with text extraction. Mostly documents with mathematical symbols and a couple which include names like f.alt, uniFB00, or g84. Poppler parses these names, changes "f.alt" into "f" and looks it up through NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode values. As far as I can tell, acroread ignores these names and just uses the character code. The ZapfDingbats names are problematic because they are so generic. It is unlikely that a PDF producer would choose "omega" as a name unless it really wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a ZapfDingbats name like a102 and expecting a reader to use the character code like acroread, or Unicode value 102 like poppler. >From <a class="bz_bug_link bz_status_RESOLVED bz_closed" title="RESOLVED FIXED - problem with dingbats" href="show_bug.cgi?id=13131">bug #13131</a>, it looks like the ZapfDingbats mappings are useful for locating glyphs, but this bug shows they shouldn't be used for text extraction. The attached patch moves the ZapfDingbats names in NameToUnicodeTable.h into a separate table and separates looking up Unicode values for text and for glyph IDs.</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>