[poppler] Could not parse charref for nameToUnicode errors
Ed Catmur
ed at catmur.co.uk
Fri Dec 21 18:12:59 PST 2007
On Wed, 2007-12-19 at 14:11 +0000, Jonathan Kew wrote:
> On 19 Dec 2007, at 12:06 pm, Adrian Johnson wrote:
> > http://annarchy.freedesktop.org/~ajohnson/test.pdf
> >
> > The numbers "1", "2", and "3", are mapped to the text "test", "text",
> > and "the". The "Z" has the glyph name "g1" so it should be ignored
> > when extracting text.
> >
> > I have found a bug in the code. With the test file I get
> >
> > $ pdftotext test.pdf -
> > Error: Could not parse charref for nameToUnicode: g1
> > This is = test of text extr=?tion using the glyph n=mes
> >
> > The output should be:
> > This is a test of text extraction using the glyph names
> >
> > It looks like the glyph names "u00061" and "u0063" are not decoded
> > correctly.
>
> To be more specific, it looks as though the names are being
> interpreted as decimal rather than hexadecimal.
The problem is that the uXXXX names are being eaten by the legacy block
// Not in Adobe Glyph Mapping convention: look for names of the form
'Axx',
// 'xx', 'Ann', 'ABnn', or 'nn', where 'A' and 'B' are any letters,
'xx' is
// two hex digits, and 'nn' is 2-4 decimal digits
The solution is to move that block to after the code for dealing with
uXXXX names, which are known to be Unicode-style hex names.
Patch attached, also reduces error output.
Ed
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mapping_tables_r2.patch
Type: text/x-patch
Size: 5235 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20071222/547105d6/attachment.bin
More information about the poppler
mailing list