[Poppler-bugs] [Bug 71160] Differing number of items returned from get_text{, layout} for glyphs over page edge

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sat Nov 2 06:19:01 PDT 2013


https://bugs.freedesktop.org/show_bug.cgi?id=71160

--- Comment #2 from Carlos Garcia Campos <carlosgc at gnome.org> ---
(In reply to comment #0)
> Created attachment 88530 [details]
> PDF which returns the wrong number of glyphs/rectangles
> 
> As discussed on the mailing list, attached is a PDF containing one phrase
> where the last letter overlaps the page bounding box. Unless I'm mistaken,
> poppler_page_get_text_layout is returning 18 glyphs and
> poppler_page_get_text is returning 17.
> 
> > Yes, it's a bug, poppler_page_get_text_layout should always return the
> > same number of glyps as poppler_page_get_text. In this case the problem
> > is that TextSelectionDumper::getWordList() returns the list of words
> > inside the selection, but if a word is not completely selected (like in
> > this case because part of the word is outside the bbox) it still returns
> > the whole word.
> > 
> > So, we have at least two possibilities:
> > 
> >  - Discard characters that are off-page in
> >    poppler_page_get_text_layout.
> >  - Make TextWordSelection class public and return a list of
> >    TextWordSelection instead of a list of words so that we know in
> >    poppler_page_get_text_layout which chars of the word are selected.
> > 
> > The first option is probably easier, but the second one would also fix
> > other cases using this API in the future, and would make
> > poppler_page_get_text_layout easier, we would only need to iterate the
> > words from begin_selection to end_selection instead of from 0 to len.
> 
> My own preference for my use case is to not discard information. It would be
> great if the solution could ensure that all glyphs are returned, even if
> they go over the edge of the page or are off the page.

I don't think we should return characters that are not inside the page. What is
your use case exactly?

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131102/953c44a3/attachment.html>


More information about the Poppler-bugs mailing list