<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Handling of small caps typographic variants"
href="https://bugs.freedesktop.org/show_bug.cgi?id=38456#c1">Comment # 1</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Handling of small caps typographic variants"
href="https://bugs.freedesktop.org/show_bug.cgi?id=38456">bug 38456</a>
from <span class="vcard"><a class="email" href="mailto:jason@aquaticape.us" title="Jason Crain <jason@aquaticape.us>"> <span class="fn">Jason Crain</span></a>
</span></b>
<pre>Created <span class=""><a href="attachment.cgi?id=91907" name="attach_91907" title="Don't parse hex/decimal from character names">attachment 91907</a> <a href="attachment.cgi?id=91907&action=edit" title="Don't parse hex/decimal from character names">[details]</a></span> <a href='page.cgi?id=splinter.html&bug=38456&attachment=91907'>[review]</a>
Don't parse hex/decimal from character names
This document has type3 fonts with character names like /BD /BC /CD etc.
Poppler is using these names as hex code Unicode values.
The document in <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Handling of small caps typographic variants"
href="show_bug.cgi?id=38456">bug #38456</a> is similar. It's using names like /c251, /c255,
/c262. Poppler is using these numbers as the Unicode values.
Poppler and Xpdf are the only programs I've found that use the character name
this way. Others just use the charcode. This patch removes the decimal and
hex parsing and uses the charcode as fallback.
The side effects are mostly spacing differences from pdftotext due to adding
charcode values that were previously left out. The only document I've found
that really breaks is the "Another pdf" attached to <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - pdftotext reversed words"
href="show_bug.cgi?id=16032">bug #16032</a>, file name
"FAO_Nutri_goodnutrition in Crisis.pdf". It's using names /g84, /g104 and
expects those names to be used as decimal Unicode values. I don't know of a
way to get both sets of these files to work at the same time, but maybe that's
OK because the other programs I've tried can't extract text from this FAO
document either.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>