[poppler] Incompatible number of glyphs from glib get_text{, layout}
Peter Waller
peter at scraperwiki.com
Tue May 26 09:43:51 PDT 2015
I learned a bit more about PDFs today :)
I believe I've found the offending TJ:
/C0_0 1 Tf
15.9927 0 9.0157 13.2093 304.8821 331.25 Tm
[<00170016001000000037>55<000e>74<00000033>9<0057>4<00410052>-24<0054005a00560049004c004c004500000032>-4<0044>20<000e>]TJ
Font:
...
/Font <<
/C0_0 18 0 R
...
%% Original object ID: 123 0
18 0 obj
<<
/BaseFont /CDGGAZ+Myriad-Roman
/DescendantFonts 66 0 R
/Encoding /Identity-H
/Subtype /Type0
/Type /Font
>>
endobj
Notably, it's missing a /ToUnicode, which all of the other fonts have.
I inspected the font object which has `/Subtype CIDFontType0C`, which
I extracted using pdftosrc. Unfortunately, file does not recognize the
format and I'm struggling to find anything able to read it. Hints
appreciated.
So, is there a poppler bug here? It seems that the glib API is having
Identity-H encoded characters (including nulls) emitted via the
poppler_page_get_text API, which is messing up the C-string length. So
should the API instead drop those charactars for which there isn't a
unicode mapping?
Thanks in advance?
On 26 May 2015 at 12:56, Peter Waller <peter at scraperwiki.com> wrote:
> I forgot to note that I transformed unprintable characters to "X" in
> my dumped representation.
More information about the poppler
mailing list