[poppler] Multicolumn select

Albert Astals Cid aacid at kde.org
Wed Nov 18 15:41:40 PST 2009


A Dilluns, 16 de novembre de 2009, Baz va escriure:
> 2009/11/16 Baz <brian.ewins at gmail.com>:
> > 2009/11/15 Albert Astals Cid <aacid at kde.org>:
> >> A Diumenge, 15 de novembre de 2009, Carlos Garcia Campos va escriure:
> >>> Excerpts from Baz's message of vie nov 13 12:56:26 +0100 2009:
> >>> > Hi,
> >>>
> >>> Hi Brian,
> >>>
> >>> > I uploaded a new version of my multicolumn select patches to
> >>> > https://bugs.freedesktop.org/show_bug.cgi?id=3188 this morning, as
> >>> > you might've seen.
> >>>
> >>> Yes, it's great to know you are working on this again :-) thank you
> >>> very much.
> >>>
> >>> > This version uses a similar algorithm to ocropus to
> >>> > determine reading order, and tries to make the selection follow this
> >>> > reading order. Its looking fairly good now I think - for all but one
> >>> > of the documents I tested with it picked a reasonable order, and
> >>> > selection doesn't jump all over the place. Of course, I've only
> >>> > tested on the handful of docs that were in the bug reports so I
> >>> > might've made things worse elsewhere :(
> >>>
> >>> I've just tried it and I've found some issues, see self-explanatory
> >>> screenshots:
> >>>
> >>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue1.png
> >>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue2.png
> >>>
> >>> The line selection (triple-click) seems to be broken too.
> >>>
> >>> > I was wondering what I can do to get these patches into an acceptable
> >>> > state. There's some obvious issues still to iron out, eg RTL (see
> >>> > http://bugs.kde.org/show_bug.cgi?id=156380 ,
> >>> > http://bugs.kde.org/show_bug.cgi?id=184399) and handling blocks with
> >>> > non-zero rotation; also the new depth_first_visit method I added is
> >>> > in the wrong class - should probably be in TextBlock. I'll fix this
> >>> > up.
> >>>
> >>> Current behaviour has been broken for a long time, any improvement
> >>> even still a bit broken, is very appreciated.
> >>>
> >>> > But beyond that, these patches might be problematic because they
> >>> > remove the old selection behaviour. The new behaviour is much better
> >>> > for multicolumn documents, but is likely to be worse at selecting
> >>> > data out of tables, for example. Should the new selection mode
> >>> > introduce new API, so as not to change the current behaviour of
> >>> > Evince & Okular[1]?
> >>>
> >>> Having a new API would definitely make things easier, yes.
> >>>
> >>> > In older versions of acrobat, they had 'table select' and
> >>> > 'text select' modes, covering these two uses, but more recently table
> >>> > select has been dropped entirely. I suspect that they now just follow
> >>> > the tags in tagged pdf, with the fallback behaviour being something
> >>> > like what I've coded up here.
> >>> >
> >>> > Also, testing. At the moment, testing for me consists of opening a
> >>> > bunch of documents in Evince and selecting stuff randomly (I don't
> >>> > have Okular, but since they use the same API for text selection I
> >>> > presume the bug is the same).
> >>>
> >>> Well, Okular doesn't use TextOutputDev for selecting, but it does for
> >>> extracting the text, so it will be affected anyway.
> >>>
> >>> > I have no idea if I'm introducing
> >>> > regressions. Is there a plan to integrate the unit test framework
> >>> > that was discussed previously?
> >>> > http://lists.freedesktop.org/archives/poppler/2009-March/004535.html
> >>> > .
> >>>
> >>> Yes, but I didn't manage to get it working without crashing :-(
> >>>
> >>> > Or failing that, is there a pool somewhere of test documents for
> >>> > poppler/evince/okular?
> >>>
> >>> Yes, Albert has a regression test script, so he can run it with your
> >>> patches applied.
> >>
> >> Is it enough if i run pdftotext and compare it's output?
> >
> > Probably not. The bug was all about selecting regions from the page.
> > I've also not paid any attention to whether this code (visitSelection,
> > etc) is exercised by pdftotext or whether that just loops through the
> > blocks. However, I guess its worth checking if I've introduced a
> > regression there. If you're hitting the new code, its definitely got
> > different output for multicolumn since I don't preserve layout.
> 
> I've checked now... yes pdftotext with no flags will hit the new
> reading order code.

And that is good or bad? :D

Albert

> 
> >> Should i do it now or wait for a patch that fixes the issues you've
> >> pointed out?
> >
> > The code now is fine to test with. All thats left to do is RTL.
> >
> > Cheers,
> > Baz
> >
> >> Albert
> >>
> >>> > Particularly if someone has docs with rotated
> >>> > blocks, and an RTL doc to test; neither the RTL selection or search
> >>> > bugs had docs attached; also vertical text I guess.
> >>> >
> >>> > Cheers,
> >>> > Baz
> >>>
> >>> We are closer to fix it, keep up the good work!
> >>
> >> _______________________________________________
> >> poppler mailing list
> >> poppler at lists.freedesktop.org
> >> http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 



More information about the poppler mailing list