[Poppler-bugs] [Bug 71160] New: Differing number of items returned from get_text{, layout} for glyphs over page edge

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sat Nov 2 05:26:57 PDT 2013


https://bugs.freedesktop.org/show_bug.cgi?id=71160

          Priority: medium
            Bug ID: 71160
          Assignee: poppler-bugs at lists.freedesktop.org
           Summary: Differing number of items returned from
                    get_text{,layout} for glyphs over page edge
          Severity: normal
    Classification: Unclassified
                OS: All
          Reporter: p at pwaller.net
          Hardware: All
            Status: NEW
           Version: unspecified
         Component: glib frontend
           Product: poppler

Created attachment 88530
  --> https://bugs.freedesktop.org/attachment.cgi?id=88530&action=edit
PDF which returns the wrong number of glyphs/rectangles

As discussed on the mailing list, attached is a PDF containing one phrase where
the last letter overlaps the page bounding box. Unless I'm mistaken,
poppler_page_get_text_layout is returning 18 glyphs and poppler_page_get_text
is returning 17.

> Yes, it's a bug, poppler_page_get_text_layout should always return the
> same number of glyps as poppler_page_get_text. In this case the problem
> is that TextSelectionDumper::getWordList() returns the list of words
> inside the selection, but if a word is not completely selected (like in
> this case because part of the word is outside the bbox) it still returns
> the whole word.
> 
> So, we have at least two possibilities:
> 
>  - Discard characters that are off-page in
>    poppler_page_get_text_layout.
>  - Make TextWordSelection class public and return a list of
>    TextWordSelection instead of a list of words so that we know in
>    poppler_page_get_text_layout which chars of the word are selected.
> 
> The first option is probably easier, but the second one would also fix
> other cases using this API in the future, and would make
> poppler_page_get_text_layout easier, we would only need to iterate the
> words from begin_selection to end_selection instead of from 0 to len.

My own preference for my use case is to not discard information. It would be
great if the solution could ensure that all glyphs are returned, even if they
go over the edge of the page or are off the page.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler-bugs/attachments/20131102/ec9ecb53/attachment.html>


More information about the Poppler-bugs mailing list