<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW --- - Evince cannot copy text from PDF documents with Computer Modern fonts" href="https://bugs.freedesktop.org/show_bug.cgi?id=60243#c10">Comment # 10</a> on <a class="bz_bug_link bz_status_NEW " title="NEW --- - Evince cannot copy text from PDF documents with Computer Modern fonts" href="https://bugs.freedesktop.org/show_bug.cgi?id=60243">bug 60243</a> from <a class="email" href="mailto:carlosgc@gnome.org" title="Carlos Garcia Campos <carlosgc@gnome.org>"> Carlos Garcia Campos</a> <pre>(In reply to <a href="show_bug.cgi?id=60243#c9">comment #9</a>) > Created <a href="attachment.cgi?id=83687" name="attach_83687" title="Use ZapfDingbats names to locate glyphs only">attachment 83687</a> <a href="attachment.cgi?id=83687&action=edit" title="Use ZapfDingbats names to locate glyphs only">[details]</a> <a href='page.cgi?id=splinter.html&bug=60243&attachment=83687'>[review]</a> [review] > Use ZapfDingbats names to locate glyphs only > > (In reply to <a href="show_bug.cgi?id=60243#c8">comment #8</a>) > > So the great question here is, how does Adobe do it right if they are > > supossedly using the same mappings as we are? > > > > Should we not use that mapping for Type 3 fonts likes the one in this bug? > > I've created a few test PDFs to see how acroread uses character names. The > short answer is they aren't using the same mappings. > > I've found that acroread ignores most character names for text extraction. > This includes ZapfDingbats (a1-a206), but also many others. In total, > acroread only uses about 700 of the 4k names in NameToUnicodeTable.h. In > this case, it uses the character code for text. As stated in <a href="show_bug.cgi?id=60243#c7">comment #7</a>, > this bug is because poppler uses ZapfDingbats names to find the text > mapping, while acroread doesn't. > > But acroread *does* use character names when finding the glyph to display > (in most cases - there seems to be some special treatment if the base font > is ZapfDingbats). I think it has less trouble finding the correct glyph > because it brings along its own fonts. > > For text extraction: if we try to emulate acroread too closely, some PDFs > show regressions with text extraction. Mostly documents with mathematical > symbols and a couple which include names like f.alt, uniFB00, or g84. > Poppler parses these names, changes "f.alt" into "f" and looks it up through > NameToUnicodeTable.h, and parses the other two as hex or decimal Unicode > values. As far as I can tell, acroread ignores these names and just uses > the character code. > > The ZapfDingbats names are problematic because they are so generic. It is > unlikely that a PDF producer would choose "omega" as a name unless it really > wants U+03C9 GREEK SMALL LETTER OMEGA, but I can see a producer generating a > ZapfDingbats name like a102 and expecting a reader to use the character code > like acroread, or Unicode value 102 like poppler. > > From <a class="bz_bug_link bz_status_RESOLVED bz_closed" title="RESOLVED FIXED - problem with dingbats" href="show_bug.cgi?id=13131">bug #13131</a>, it looks like the ZapfDingbats mappings are useful for > locating glyphs, but this bug shows they shouldn't be used for text > extraction. The attached patch moves the ZapfDingbats names in > NameToUnicodeTable.h into a separate table and separates looking up Unicode > values for text and for glyph IDs. It makes sense to me. Albert, could you pass the tests with the patch?</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>