<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW --- - Handling of small caps typographic variants" href="https://bugs.freedesktop.org/show_bug.cgi?id=38456#c1">Comment # 1</a> on <a class="bz_bug_link bz_status_NEW " title="NEW --- - Handling of small caps typographic variants" href="https://bugs.freedesktop.org/show_bug.cgi?id=38456">bug 38456</a> from <a class="email" href="mailto:jason@aquaticape.us" title="Jason Crain <jason@aquaticape.us>"> Jason Crain</a> <pre>Created <a href="attachment.cgi?id=91907" name="attach_91907" title="Don't parse hex/decimal from character names">attachment 91907</a> <a href="attachment.cgi?id=91907&action=edit" title="Don't parse hex/decimal from character names">[details]</a> <a href='page.cgi?id=splinter.html&bug=38456&attachment=91907'>[review]</a> Don't parse hex/decimal from character names This document has type3 fonts with character names like /BD /BC /CD etc. Poppler is using these names as hex code Unicode values. The document in <a class="bz_bug_link bz_status_NEW " title="NEW --- - Handling of small caps typographic variants" href="show_bug.cgi?id=38456">bug #38456</a> is similar. It's using names like /c251, /c255, /c262. Poppler is using these numbers as the Unicode values. Poppler and Xpdf are the only programs I've found that use the character name this way. Others just use the charcode. This patch removes the decimal and hex parsing and uses the charcode as fallback. The side effects are mostly spacing differences from pdftotext due to adding charcode values that were previously left out. The only document I've found that really breaks is the "Another pdf" attached to <a class="bz_bug_link bz_status_NEW " title="NEW --- - pdftotext reversed words" href="show_bug.cgi?id=16032">bug #16032</a>, file name "FAO_Nutri_goodnutrition in Crisis.pdf". It's using names /g84, /g104 and expects those names to be used as decimal Unicode values. I don't know of a way to get both sets of these files to work at the same time, but maybe that's OK because the other programs I've tried can't extract text from this FAO document either.</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>