[poppler] Differing number of items returned from get_text{, layout} for glyphs over page edge

Carlos Garcia Campos carlosgc at gnome.org
Sat Nov 2 12:46:45 CET 2013


Peter Waller <peter at scraperwiki.com> writes:

> Hi All,
>
> I attach a PDF containing one phrase where the last letter overlaps the
> page bounding box. Unless I'm mistaken, poppler_page_get_text_layout is
> returning 18 glyphs and poppler_page_get_text is returning 17.
>
> Can anyone else confirm? I'm running 0.24.1.

Yes, I confirm it.

> Is this a bug? Can I safely filter out layout rectangles which are off-page?

Yes, it's a bug, poppler_page_get_text_layout should always return the
same number of glyps as poppler_page_get_text. In this case the problem
is that TextSelectionDumper::getWordList() returns the list of words
inside the selection, but if a word is not completely selected (like in
this case because part of the word is outside the bbox) it still returns
the whole word.

So, we have at least two possibilities:

 - Discard characters that are off-page in
   poppler_page_get_text_layout.
 - Make TextWordSelection class public and return a list of
   TextWordSelection instead of a list of words so that we know in
   poppler_page_get_text_layout which chars of the word are selected.

The first option is probably easier, but the second one would also fix
other cases using this API in the future, and would make
poppler_page_get_text_layout easier, we would only need to iterate the
words from begin_selection to end_selection instead of from 0 to len.

> Thanks in advance,
>
> - Peter
>  <poppler at lists.freedesktop.org>

Regards, 
-- 
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20131102/786eb050/attachment.pgp>


More information about the poppler mailing list