[poppler] Multicolumn select

Baz brian.ewins at gmail.com
Sun Nov 15 17:17:32 PST 2009


2009/11/15 Albert Astals Cid <aacid at kde.org>:
> A Diumenge, 15 de novembre de 2009, Carlos Garcia Campos va escriure:
>> Excerpts from Baz's message of vie nov 13 12:56:26 +0100 2009:
>> > Hi,
>>
>> Hi Brian,
>>
>> > I uploaded a new version of my multicolumn select patches to
>> > https://bugs.freedesktop.org/show_bug.cgi?id=3188 this morning, as you
>> > might've seen.
>>
>> Yes, it's great to know you are working on this again :-) thank you
>> very much.
>>
>> > This version uses a similar algorithm to ocropus to
>> > determine reading order, and tries to make the selection follow this
>> > reading order. Its looking fairly good now I think - for all but one
>> > of the documents I tested with it picked a reasonable order, and
>> > selection doesn't jump all over the place. Of course, I've only tested
>> > on the handful of docs that were in the bug reports so I might've made
>> > things worse elsewhere :(
>>
>> I've just tried it and I've found some issues, see self-explanatory
>> screenshots:
>>
>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue1.png
>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue2.png
>>
>> The line selection (triple-click) seems to be broken too.
>>
>> > I was wondering what I can do to get these patches into an acceptable
>> > state. There's some obvious issues still to iron out, eg RTL (see
>> > http://bugs.kde.org/show_bug.cgi?id=156380 ,
>> > http://bugs.kde.org/show_bug.cgi?id=184399) and handling blocks with
>> > non-zero rotation; also the new depth_first_visit method I added is in
>> > the wrong class - should probably be in TextBlock. I'll fix this up.
>>
>> Current behaviour has been broken for a long time, any improvement
>> even still a bit broken, is very appreciated.
>>
>> > But beyond that, these patches might be problematic because they
>> > remove the old selection behaviour. The new behaviour is much better
>> > for multicolumn documents, but is likely to be worse at selecting data
>> > out of tables, for example. Should the new selection mode introduce
>> > new API, so as not to change the current behaviour of Evince &
>> > Okular[1]?
>>
>> Having a new API would definitely make things easier, yes.
>>
>> > In older versions of acrobat, they had 'table select' and
>> > 'text select' modes, covering these two uses, but more recently table
>> > select has been dropped entirely. I suspect that they now just follow
>> > the tags in tagged pdf, with the fallback behaviour being something
>> > like what I've coded up here.
>> >
>> > Also, testing. At the moment, testing for me consists of opening a
>> > bunch of documents in Evince and selecting stuff randomly (I don't
>> > have Okular, but since they use the same API for text selection I
>> > presume the bug is the same).
>>
>> Well, Okular doesn't use TextOutputDev for selecting, but it does for
>> extracting the text, so it will be affected anyway.
>>
>> > I have no idea if I'm introducing
>> > regressions. Is there a plan to integrate the unit test framework that
>> > was discussed previously?
>> > http://lists.freedesktop.org/archives/poppler/2009-March/004535.html
>> > .
>>
>> Yes, but I didn't manage to get it working without crashing :-(
>>
>> > Or failing that, is there a pool somewhere of test documents for
>> > poppler/evince/okular?
>>
>> Yes, Albert has a regression test script, so he can run it with your
>> patches applied.
>
> Is it enough if i run pdftotext and compare it's output?

Probably not. The bug was all about selecting regions from the page.
I've also not paid any attention to whether this code (visitSelection,
etc) is exercised by pdftotext or whether that just loops through the
blocks. However, I guess its worth checking if I've introduced a
regression there. If you're hitting the new code, its definitely got
different output for multicolumn since I don't preserve layout.

>
> Should i do it now or wait for a patch that fixes the issues you've pointed
> out?

The code now is fine to test with. All thats left to do is RTL.

Cheers,
Baz

>
> Albert
>
>>
>> > Particularly if someone has docs with rotated
>> > blocks, and an RTL doc to test; neither the RTL selection or search
>> > bugs had docs attached; also vertical text I guess.
>> >
>> > Cheers,
>> > Baz
>>
>> We are closer to fix it, keep up the good work!
>>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>


More information about the poppler mailing list