[poppler] Multicolumn select

Baz brian.ewins at gmail.com
Sun Nov 15 17:04:58 PST 2009


2009/11/15 Carlos Garcia Campos <carlosgc at gnome.org>:
> Excerpts from Baz's message of vie nov 13 12:56:26 +0100 2009:
>> Hi,
>
> Hi Brian,
>
>> I uploaded a new version of my multicolumn select patches to
>> https://bugs.freedesktop.org/show_bug.cgi?id=3188 this morning, as you
>> might've seen.
>
> Yes, it's great to know you are working on this again :-) thank you
> very much.
>
>> This version uses a similar algorithm to ocropus to
>> determine reading order, and tries to make the selection follow this
>> reading order. Its looking fairly good now I think - for all but one
>> of the documents I tested with it picked a reasonable order, and
>> selection doesn't jump all over the place. Of course, I've only tested
>> on the handful of docs that were in the bug reports so I might've made
>> things worse elsewhere :(
>
> I've just tried it and I've found some issues, see self-explanatory
> screenshots:
>
> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue1.png

The reading order algorithm thinks a block A is before block B if
(rule1) block A overlaps and is above block B; or (rule 2) block A is
left of block B and there is no block C such that B is before C by
rule 1, and C is before A by rule 1.

Here, 'Introduction' is to the left of the address and doesn't overlap
it. Hence rule 2 applies and Introduction is seen as being before the
address. In ocropus, this particular bug wouldn't happen because the
lines are expanded left & right to fit the column they belong to (ie
'Introduction' would be expanded right), though bugs of this kind are
still possible. A lot of the bugs I'm seeing are due to short
paragraphs like this.

> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue2.png

This one is down to the bullet points being allocated blocks of their
own. I havent touched the code that builds blocks yet.

>
> The line selection (triple-click) seems to be broken too.

Thanks, fixed it - its working again in the latest round of patches.

>
>> I was wondering what I can do to get these patches into an acceptable
>> state. There's some obvious issues still to iron out, eg RTL (see
>> http://bugs.kde.org/show_bug.cgi?id=156380 ,
>> http://bugs.kde.org/show_bug.cgi?id=184399) and handling blocks with
>> non-zero rotation; also the new depth_first_visit method I added is in
>> the wrong class - should probably be in TextBlock. I'll fix this up.
>
> Current behaviour has been broken for a long time, any improvement
> even still a bit broken, is very appreciated.
>
>> But beyond that, these patches might be problematic because they
>> remove the old selection behaviour. The new behaviour is much better
>> for multicolumn documents, but is likely to be worse at selecting data
>> out of tables, for example. Should the new selection mode introduce
>> new API, so as not to change the current behaviour of Evince &
>> Okular[1]?
>
> Having a new API would definitely make things easier, yes.

I'd need some hints. I'm already well beyond my comfort zone poking
around with this stuff; I don't do C++. I can manage to fill in the
blanks if I have an idea what the API you want is though.

-Baz

>
>> In older versions of acrobat, they had 'table select' and
>> 'text select' modes, covering these two uses, but more recently table
>> select has been dropped entirely. I suspect that they now just follow
>> the tags in tagged pdf, with the fallback behaviour being something
>> like what I've coded up here.
>>
>> Also, testing. At the moment, testing for me consists of opening a
>> bunch of documents in Evince and selecting stuff randomly (I don't
>> have Okular, but since they use the same API for text selection I
>> presume the bug is the same).
>
> Well, Okular doesn't use TextOutputDev for selecting, but it does for
> extracting the text, so it will be affected anyway.
>
>> I have no idea if I'm introducing
>> regressions. Is there a plan to integrate the unit test framework that
>> was discussed previously?
>> http://lists.freedesktop.org/archives/poppler/2009-March/004535.html
>> .
>
> Yes, but I didn't manage to get it working without crashing :-(
>
>> Or failing that, is there a pool somewhere of test documents for
>> poppler/evince/okular?
>
> Yes, Albert has a regression test script, so he can run it with your
> patches applied.
>
>> Particularly if someone has docs with rotated
>> blocks, and an RTL doc to test; neither the RTL selection or search
>> bugs had docs attached; also vertical text I guess.
>>
>> Cheers,
>> Baz
>
> We are closer to fix it, keep up the good work!
> --
> Carlos Garcia Campos
> PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>
>


More information about the poppler mailing list