[poppler] Multicolumn select

Albert Astals Cid aacid at kde.org
Sun Nov 15 14:38:10 PST 2009


A Diumenge, 15 de novembre de 2009, Carlos Garcia Campos va escriure:
> Excerpts from Baz's message of vie nov 13 12:56:26 +0100 2009:
> > Hi,
> 
> Hi Brian,
> 
> > I uploaded a new version of my multicolumn select patches to
> > https://bugs.freedesktop.org/show_bug.cgi?id=3188 this morning, as you
> > might've seen.
> 
> Yes, it's great to know you are working on this again :-) thank you
> very much.
> 
> > This version uses a similar algorithm to ocropus to
> > determine reading order, and tries to make the selection follow this
> > reading order. Its looking fairly good now I think - for all but one
> > of the documents I tested with it picked a reasonable order, and
> > selection doesn't jump all over the place. Of course, I've only tested
> > on the handful of docs that were in the bug reports so I might've made
> > things worse elsewhere :(
> 
> I've just tried it and I've found some issues, see self-explanatory
> screenshots:
> 
> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue1.png
> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue2.png
> 
> The line selection (triple-click) seems to be broken too.
> 
> > I was wondering what I can do to get these patches into an acceptable
> > state. There's some obvious issues still to iron out, eg RTL (see
> > http://bugs.kde.org/show_bug.cgi?id=156380 ,
> > http://bugs.kde.org/show_bug.cgi?id=184399) and handling blocks with
> > non-zero rotation; also the new depth_first_visit method I added is in
> > the wrong class - should probably be in TextBlock. I'll fix this up.
> 
> Current behaviour has been broken for a long time, any improvement
> even still a bit broken, is very appreciated.
> 
> > But beyond that, these patches might be problematic because they
> > remove the old selection behaviour. The new behaviour is much better
> > for multicolumn documents, but is likely to be worse at selecting data
> > out of tables, for example. Should the new selection mode introduce
> > new API, so as not to change the current behaviour of Evince &
> > Okular[1]?
> 
> Having a new API would definitely make things easier, yes.
> 
> > In older versions of acrobat, they had 'table select' and
> > 'text select' modes, covering these two uses, but more recently table
> > select has been dropped entirely. I suspect that they now just follow
> > the tags in tagged pdf, with the fallback behaviour being something
> > like what I've coded up here.
> >
> > Also, testing. At the moment, testing for me consists of opening a
> > bunch of documents in Evince and selecting stuff randomly (I don't
> > have Okular, but since they use the same API for text selection I
> > presume the bug is the same).
> 
> Well, Okular doesn't use TextOutputDev for selecting, but it does for
> extracting the text, so it will be affected anyway.
> 
> > I have no idea if I'm introducing
> > regressions. Is there a plan to integrate the unit test framework that
> > was discussed previously?
> > http://lists.freedesktop.org/archives/poppler/2009-March/004535.html
> > .
> 
> Yes, but I didn't manage to get it working without crashing :-(
> 
> > Or failing that, is there a pool somewhere of test documents for
> > poppler/evince/okular?
> 
> Yes, Albert has a regression test script, so he can run it with your
> patches applied.

Is it enough if i run pdftotext and compare it's output?

Should i do it now or wait for a patch that fixes the issues you've pointed 
out?

Albert

> 
> > Particularly if someone has docs with rotated
> > blocks, and an RTL doc to test; neither the RTL selection or search
> > bugs had docs attached; also vertical text I guess.
> >
> > Cheers,
> > Baz
> 
> We are closer to fix it, keep up the good work!
> 



More information about the poppler mailing list