[poppler] Could not parse charref for nameToUnicode errors

Fri Dec 21 18:12:59 PST 2007

On Wed, 2007-12-19 at 14:11 +0000, Jonathan Kew wrote:
> On 19 Dec 2007, at 12:06 pm, Adrian Johnson wrote:
> > http://annarchy.freedesktop.org/~ajohnson/test.pdf
> >
> > The numbers "1", "2", and "3", are mapped to the text "test", "text",
> > and "the". The "Z" has the glyph name "g1" so it should be ignored  
> > when extracting text.
> >
> > I have found a bug in the code. With the test file I get
> >
> >  $ pdftotext test.pdf -
> >  Error: Could not parse charref for nameToUnicode: g1
> >  This is = test of text extr=?tion using the glyph n=mes
> >
> > The output should be:
> >  This is a test of text extraction using the glyph names
> >
> > It looks like the glyph names "u00061" and "u0063" are not decoded
> > correctly.
> 
> To be more specific, it looks as though the names are being  
> interpreted as decimal rather than hexadecimal.

The problem is that the uXXXX names are being eaten by the legacy block
    // Not in Adobe Glyph Mapping convention: look for names of the form
'Axx',
    // 'xx', 'Ann', 'ABnn', or 'nn', where 'A' and 'B' are any letters,
'xx' is
    // two hex digits, and 'nn' is 2-4 decimal digits

The solution is to move that block to after the code for dealing with
uXXXX names, which are known to be Unicode-style hex names.

Patch attached, also reduces error output.

Ed
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mapping_tables_r2.patch
Type: text/x-patch
Size: 5235 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/poppler/attachments/20071222/547105d6/attachment.bin