[poppler] Incompatible number of glyphs from glib get_text{, layout}
William Bader
williambader at hotmail.com
Tue May 26 11:26:06 PDT 2015
Is the difference the italic text "760 W. Swartzville Rd. Reinholds, PA 17569"?
That is not the address of Zook Interiors, right?
Is that a hidden mark added by the person who created the PDF?
Maybe they intentionally used an incorrect coding.
Then the question might be how the two different methods of extracting information respond to invalid data in the PDF.
pdftotext does not handle that text correctly, and ps2ascii (from ghostscript 9.16) crashes on it with
**** Warning: considering '0000000000 XXXXX n' as a free entry.
*** Warning: composite font characters dumped without decoding.
If a PDF breaks both poppler and ghostscript, the problem is probably the PDF.
pdfinfo shows that the file was made by pdftk 1.44, so it could be a bug or intentional change in pdftk.
William
From: peter at scraperwiki.com
Date: Tue, 26 May 2015 10:53:52 +0100
To: poppler at lists.freedesktop.org
Subject: Re: [poppler] Incompatible number of glyphs from glib get_text{, layout}
On 17 January 2014 at 10:30, Peter Waller <peter at scraperwiki.com> wrote:
A screenshot from the poppler glib demo app demonstrates this, attached below. Poppler gets 696 characters and 1261 layout rectangles.<snip>
http://pwaller.net/sw/2014-01-17-broken.pdf
<snip>
I've reported this on bugzilla here: https://bugs.freedesktop.org/show_bug.cgi?id=73885
Link to old thread: http://thread.gmane.org/gmane.comp.freedesktop.poppler/8683
I've investigated this briefly. An observation:
http://cgit.freedesktop.org/poppler/poppler/tree/glib/poppler-page.cc?id=poppler-0.33.0#n825
The sel_text->getLength() is 1283 (which doesn't match with the 1261 from poppler_page_get_layout).
If I change this to use a g_strndup() with the correct length:
result = g_strndup (sel_text->getCString (), sel_text->getLength());
And then look at result[696:], then I find that the rest of the string is filled with 0 bytes.
I'm extremely keen to get this fixed, so any pointers would be appreciated. The rate of encountering this bug is increasing all the time!
Thanks,
- Peter
_______________________________________________
poppler mailing list
poppler at lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/poppler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20150526/2def222c/attachment-0001.html>
More information about the poppler
mailing list