[poppler] Multicolumn select

Albert Astals Cid aacid at kde.org
Wed Nov 18 13:29:30 PST 2009


A Dilluns, 16 de novembre de 2009, Baz va escriure:
> 2009/11/15 Carlos Garcia Campos <carlosgc at gnome.org>:
> > Excerpts from Baz's message of vie nov 13 12:56:26 +0100 2009:
> >> Hi,
> >
> > Hi Brian,
> >
> >> I uploaded a new version of my multicolumn select patches to
> >> https://bugs.freedesktop.org/show_bug.cgi?id=3188 this morning, as you
> >> might've seen.
> >
> > Yes, it's great to know you are working on this again :-) thank you
> > very much.
> >
> >> This version uses a similar algorithm to ocropus to
> >> determine reading order, and tries to make the selection follow this
> >> reading order. Its looking fairly good now I think - for all but one
> >> of the documents I tested with it picked a reasonable order, and
> >> selection doesn't jump all over the place. Of course, I've only tested
> >> on the handful of docs that were in the bug reports so I might've made
> >> things worse elsewhere :(
> >
> > I've just tried it and I've found some issues, see self-explanatory
> > screenshots:
> >
> > http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue1.png
> 
> The reading order algorithm thinks a block A is before block B if
> (rule1) block A overlaps and is above block B; or (rule 2) block A is
> left of block B and there is no block C such that B is before C by
> rule 1, and C is before A by rule 1.
> 
> Here, 'Introduction' is to the left of the address and doesn't overlap
> it. Hence rule 2 applies and Introduction is seen as being before the
> address. In ocropus, this particular bug wouldn't happen because the
> lines are expanded left & right to fit the column they belong to (ie
> 'Introduction' would be expanded right), though bugs of this kind are
> still possible. A lot of the bugs I'm seeing are due to short
> paragraphs like this.
> 
> > http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue2.png
> 
> This one is down to the bullet points being allocated blocks of their
> own. I havent touched the code that builds blocks yet.
> 
> > The line selection (triple-click) seems to be broken too.
> 
> Thanks, fixed it - its working again in the latest round of patches.
> 
> >> I was wondering what I can do to get these patches into an acceptable
> >> state. There's some obvious issues still to iron out, eg RTL (see
> >> http://bugs.kde.org/show_bug.cgi?id=156380 ,
> >> http://bugs.kde.org/show_bug.cgi?id=184399) and handling blocks with
> >> non-zero rotation; also the new depth_first_visit method I added is in
> >> the wrong class - should probably be in TextBlock. I'll fix this up.
> >
> > Current behaviour has been broken for a long time, any improvement
> > even still a bit broken, is very appreciated.
> >
> >> But beyond that, these patches might be problematic because they
> >> remove the old selection behaviour. The new behaviour is much better
> >> for multicolumn documents, but is likely to be worse at selecting data
> >> out of tables, for example. Should the new selection mode introduce
> >> new API, so as not to change the current behaviour of Evince &
> >> Okular[1]?
> >
> > Having a new API would definitely make things easier, yes.
> 
> I'd need some hints. I'm already well beyond my comfort zone poking
> around with this stuff; I don't do C++. I can manage to fill in the
> blanks if I have an idea what the API you want is though.

How is the new selection behaviour worse? Because it thinks texts in tables is 
columns?

Albert

> 
> -Baz
> 
> >> In older versions of acrobat, they had 'table select' and
> >> 'text select' modes, covering these two uses, but more recently table
> >> select has been dropped entirely. I suspect that they now just follow
> >> the tags in tagged pdf, with the fallback behaviour being something
> >> like what I've coded up here.
> >>
> >> Also, testing. At the moment, testing for me consists of opening a
> >> bunch of documents in Evince and selecting stuff randomly (I don't
> >> have Okular, but since they use the same API for text selection I
> >> presume the bug is the same).
> >
> > Well, Okular doesn't use TextOutputDev for selecting, but it does for
> > extracting the text, so it will be affected anyway.
> >
> >> I have no idea if I'm introducing
> >> regressions. Is there a plan to integrate the unit test framework that
> >> was discussed previously?
> >> http://lists.freedesktop.org/archives/poppler/2009-March/004535.html
> >> .
> >
> > Yes, but I didn't manage to get it working without crashing :-(
> >
> >> Or failing that, is there a pool somewhere of test documents for
> >> poppler/evince/okular?
> >
> > Yes, Albert has a regression test script, so he can run it with your
> > patches applied.
> >
> >> Particularly if someone has docs with rotated
> >> blocks, and an RTL doc to test; neither the RTL selection or search
> >> bugs had docs attached; also vertical text I guess.
> >>
> >> Cheers,
> >> Baz
> >
> > We are closer to fix it, keep up the good work!
> > --
> > Carlos Garcia Campos
> > PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
> >
> > _______________________________________________
> > poppler mailing list
> > poppler at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/poppler
> 
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/poppler
> 



More information about the poppler mailing list