[poppler] Incompatible number of glyphs from glib get_text{, layout}

Tue May 26 09:43:51 PDT 2015

I learned a bit more about PDFs today :)

I believe I've found the offending TJ:

/C0_0 1 Tf
15.9927 0 9.0157 13.2093 304.8821 331.25 Tm
[<00170016001000000037>55<000e>74<00000033>9<0057>4<00410052>-24<0054005a00560049004c004c004500000032>-4<0044>20<000e>]TJ

Font:

...

/Font <<
/C0_0 18 0 R

...

%% Original object ID: 123 0
18 0 obj
<<
  /BaseFont /CDGGAZ+Myriad-Roman
  /DescendantFonts 66 0 R
  /Encoding /Identity-H
  /Subtype /Type0
  /Type /Font
>>
endobj

Notably, it's missing a /ToUnicode, which all of the other fonts have.
I inspected the font object which has `/Subtype CIDFontType0C`, which
I extracted using pdftosrc. Unfortunately, file does not recognize the
format and I'm struggling to find anything able to read it. Hints
appreciated.

So, is there a poppler bug here? It seems that the glib API is having
Identity-H encoded characters (including nulls) emitted via the
poppler_page_get_text API, which is messing up the C-string length. So
should the API instead drop those charactars for which there isn't a
unicode mapping?

Thanks in advance?

On 26 May 2015 at 12:56, Peter Waller <peter at scraperwiki.com> wrote:
> I forgot to note that I transformed unprintable characters to "X" in
> my dumped representation.