[poppler] Multicolumn select

Baz brian.ewins at gmail.com
Wed Nov 18 17:02:29 PST 2009


2009/11/18 Albert Astals Cid <aacid at kde.org>:
> A Dilluns, 16 de novembre de 2009, Baz va escriure:
>> 2009/11/16 Baz <brian.ewins at gmail.com>:
>> > 2009/11/15 Albert Astals Cid <aacid at kde.org>:
>> >> A Diumenge, 15 de novembre de 2009, Carlos Garcia Campos va escriure:
>> >>> Excerpts from Baz's message of vie nov 13 12:56:26 +0100 2009:
>> >>> > Hi,
>> >>>
>> >>> Hi Brian,
>> >>>
>> >>> > I uploaded a new version of my multicolumn select patches to
>> >>> > https://bugs.freedesktop.org/show_bug.cgi?id=3188 this morning, as
>> >>> > you might've seen.
>> >>>
>> >>> Yes, it's great to know you are working on this again :-) thank you
>> >>> very much.
>> >>>
>> >>> > This version uses a similar algorithm to ocropus to
>> >>> > determine reading order, and tries to make the selection follow this
>> >>> > reading order. Its looking fairly good now I think - for all but one
>> >>> > of the documents I tested with it picked a reasonable order, and
>> >>> > selection doesn't jump all over the place. Of course, I've only
>> >>> > tested on the handful of docs that were in the bug reports so I
>> >>> > might've made things worse elsewhere :(
>> >>>
>> >>> I've just tried it and I've found some issues, see self-explanatory
>> >>> screenshots:
>> >>>
>> >>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue1.png
>> >>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue2.png
>> >>>
>> >>> The line selection (triple-click) seems to be broken too.
>> >>>
>> >>> > I was wondering what I can do to get these patches into an acceptable
>> >>> > state. There's some obvious issues still to iron out, eg RTL (see
>> >>> > http://bugs.kde.org/show_bug.cgi?id=156380 ,
>> >>> > http://bugs.kde.org/show_bug.cgi?id=184399) and handling blocks with
>> >>> > non-zero rotation; also the new depth_first_visit method I added is
>> >>> > in the wrong class - should probably be in TextBlock. I'll fix this
>> >>> > up.
>> >>>
>> >>> Current behaviour has been broken for a long time, any improvement
>> >>> even still a bit broken, is very appreciated.
>> >>>
>> >>> > But beyond that, these patches might be problematic because they
>> >>> > remove the old selection behaviour. The new behaviour is much better
>> >>> > for multicolumn documents, but is likely to be worse at selecting
>> >>> > data out of tables, for example. Should the new selection mode
>> >>> > introduce new API, so as not to change the current behaviour of
>> >>> > Evince & Okular[1]?
>> >>>
>> >>> Having a new API would definitely make things easier, yes.
>> >>>
>> >>> > In older versions of acrobat, they had 'table select' and
>> >>> > 'text select' modes, covering these two uses, but more recently table
>> >>> > select has been dropped entirely. I suspect that they now just follow
>> >>> > the tags in tagged pdf, with the fallback behaviour being something
>> >>> > like what I've coded up here.
>> >>> >
>> >>> > Also, testing. At the moment, testing for me consists of opening a
>> >>> > bunch of documents in Evince and selecting stuff randomly (I don't
>> >>> > have Okular, but since they use the same API for text selection I
>> >>> > presume the bug is the same).
>> >>>
>> >>> Well, Okular doesn't use TextOutputDev for selecting, but it does for
>> >>> extracting the text, so it will be affected anyway.
>> >>>
>> >>> > I have no idea if I'm introducing
>> >>> > regressions. Is there a plan to integrate the unit test framework
>> >>> > that was discussed previously?
>> >>> > http://lists.freedesktop.org/archives/poppler/2009-March/004535.html
>> >>> > .
>> >>>
>> >>> Yes, but I didn't manage to get it working without crashing :-(
>> >>>
>> >>> > Or failing that, is there a pool somewhere of test documents for
>> >>> > poppler/evince/okular?
>> >>>
>> >>> Yes, Albert has a regression test script, so he can run it with your
>> >>> patches applied.
>> >>
>> >> Is it enough if i run pdftotext and compare it's output?
>> >
>> > Probably not. The bug was all about selecting regions from the page.
>> > I've also not paid any attention to whether this code (visitSelection,
>> > etc) is exercised by pdftotext or whether that just loops through the
>> > blocks. However, I guess its worth checking if I've introduced a
>> > regression there. If you're hitting the new code, its definitely got
>> > different output for multicolumn since I don't preserve layout.
>>
>> I've checked now... yes pdftotext with no flags will hit the new
>> reading order code.
>
> And that is good or bad? :D

Well, it means you can test it without manually selecting bits, at
least. I wont have a chance to look at this again until the weekend,
but after I wrote that I dumped out the text from all the 'bad' pdf
docs I have from Acrobat 9, to use for some automated tests. I'm
thinking I can begin with 'diff -b old new|wc -l' or some such to
begin with and see how it does things differently.

Incidentally this showed up that Acrobat is not using the same
algorithm for text selection and text dump - the 'dump' version is
better, it got a very good result on a doc that its selection failed
on terribly; by comparison patched poppler was mostly ok for both.

re the qt4 test docs... ok I presume I messed up somewhere along the
line updating my build from 0.10.2; there mustn't have been test docs
in the gnome dev kit, so when I updated it didn't check them out. My
bad.

Re: "How is the new selection behaviour worse? Because it thinks texts
in tables is columns?"

Yes, precisely. There are actually a couple of bugs reported from
people who used the current selection mode and were disappointed it
messed up their tables - because poppler shifted text around to avoid
overlapping blocks, or whatever. Personally I think that's an
unrealistic expectation, and that selecting in columns is ok - at
least the data is recoverable.

Cheers,
Baz

>
> Albert
>
>>
>> >> Should i do it now or wait for a patch that fixes the issues you've
>> >> pointed out?
>> >
>> > The code now is fine to test with. All thats left to do is RTL.
>> >
>> > Cheers,
>> > Baz
>> >
>> >> Albert
>> >>
>> >>> > Particularly if someone has docs with rotated
>> >>> > blocks, and an RTL doc to test; neither the RTL selection or search
>> >>> > bugs had docs attached; also vertical text I guess.
>> >>> >
>> >>> > Cheers,
>> >>> > Baz
>> >>>
>> >>> We are closer to fix it, keep up the good work!
>> >>
>> >> _______________________________________________
>> >> poppler mailing list
>> >> poppler at lists.freedesktop.org
>> >> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
>


More information about the poppler mailing list