[poppler] Incompatible number of glyphs from glib get_text{, layout}

Peter Waller peter at scraperwiki.com
Tue May 26 14:17:43 PDT 2015


Hi William,

We see a large number of PDFs with this kind of breakage in them from
a diverse set of sources.

I used pdftk to extract the one page from it, the same issue manifests
before and after.

Thanks,

- Peter

On 26 May 2015 at 19:26, William Bader <williambader at hotmail.com> wrote:
> Is the difference the italic text "760 W. Swartzville Rd.  Reinholds, PA
> 17569"?
>
> That is not the address of Zook Interiors, right?
>
> Is that a hidden mark added by the person who created the PDF?
>
> Maybe they intentionally used an incorrect coding.
>
> Then the question might be how the two different methods of extracting
> information respond to invalid data in the PDF.
>
> pdftotext does not handle that text correctly, and ps2ascii (from
> ghostscript 9.16) crashes on it with
>
> **** Warning: considering '0000000000 XXXXX n' as a free entry.
>
> *** Warning: composite font characters dumped without decoding.
>
> If a PDF breaks both poppler and ghostscript, the problem is probably the
> PDF.
>
> pdfinfo shows that the file was made by pdftk 1.44, so it could be a bug or
> intentional change in pdftk.
>
> William
>
> ________________________________
> From: peter at scraperwiki.com
> Date: Tue, 26 May 2015 10:53:52 +0100
> To: poppler at lists.freedesktop.org
> Subject: Re: [poppler] Incompatible number of glyphs from glib get_text{,
> layout}
>
>
> On 17 January 2014 at 10:30, Peter Waller <peter at scraperwiki.com> wrote:
>
> A screenshot from the poppler glib demo app demonstrates this, attached
> below. Poppler gets 696 characters and 1261 layout rectangles.
>
> <snip>
>
> http://pwaller.net/sw/2014-01-17-broken.pdf
>
> <snip>
>
> I've reported this on bugzilla here:
> https://bugs.freedesktop.org/show_bug.cgi?id=73885
>
>
> Link to old thread:
> http://thread.gmane.org/gmane.comp.freedesktop.poppler/8683
>
> I've investigated this briefly. An observation:
>
> http://cgit.freedesktop.org/poppler/poppler/tree/glib/poppler-page.cc?id=poppler-0.33.0#n825
>
> The sel_text->getLength() is 1283 (which doesn't match with the 1261 from
> poppler_page_get_layout).
>
> If I change this to use a g_strndup() with the correct length:
>
> result = g_strndup (sel_text->getCString (), sel_text->getLength());
>
>
> And then look at result[696:], then I find that the rest of the string is
> filled with 0 bytes.
>
> I'm extremely keen to get this fixed, so any pointers would be appreciated.
> The rate of encountering this bug is increasing all the time!
>
> Thanks,
>
> - Peter
>
> _______________________________________________ poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler


More information about the poppler mailing list