<html>
<head>
<base href="https://bugs.freedesktop.org/" />
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Differing number of items returned from get_text{,layout} for glyphs over page edge"
href="https://bugs.freedesktop.org/show_bug.cgi?id=71160#c2">Comment # 2</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Differing number of items returned from get_text{,layout} for glyphs over page edge"
href="https://bugs.freedesktop.org/show_bug.cgi?id=71160">bug 71160</a>
from <span class="vcard"><a class="email" href="mailto:carlosgc@gnome.org" title="Carlos Garcia Campos <carlosgc@gnome.org>"> <span class="fn">Carlos Garcia Campos</span></a>
</span></b>
<pre>(In reply to <a href="show_bug.cgi?id=71160#c0">comment #0</a>)
<span class="quote">> Created <span class=""><a href="attachment.cgi?id=88530" name="attach_88530" title="PDF which returns the wrong number of glyphs/rectangles">attachment 88530</a> <a href="attachment.cgi?id=88530&action=edit" title="PDF which returns the wrong number of glyphs/rectangles">[details]</a></span>
> PDF which returns the wrong number of glyphs/rectangles
>
> As discussed on the mailing list, attached is a PDF containing one phrase
> where the last letter overlaps the page bounding box. Unless I'm mistaken,
> poppler_page_get_text_layout is returning 18 glyphs and
> poppler_page_get_text is returning 17.
>
> > Yes, it's a bug, poppler_page_get_text_layout should always return the
> > same number of glyps as poppler_page_get_text. In this case the problem
> > is that TextSelectionDumper::getWordList() returns the list of words
> > inside the selection, but if a word is not completely selected (like in
> > this case because part of the word is outside the bbox) it still returns
> > the whole word.
> >
> > So, we have at least two possibilities:
> >
> > - Discard characters that are off-page in
> > poppler_page_get_text_layout.
> > - Make TextWordSelection class public and return a list of
> > TextWordSelection instead of a list of words so that we know in
> > poppler_page_get_text_layout which chars of the word are selected.
> >
> > The first option is probably easier, but the second one would also fix
> > other cases using this API in the future, and would make
> > poppler_page_get_text_layout easier, we would only need to iterate the
> > words from begin_selection to end_selection instead of from 0 to len.
>
> My own preference for my use case is to not discard information. It would be
> great if the solution could ensure that all glyphs are returned, even if
> they go over the edge of the page or are off the page.</span >
I don't think we should return characters that are not inside the page. What is
your use case exactly?</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>