[poppler] Multicolumn select

Baz brian.ewins at gmail.com
Mon Nov 16 04:11:17 PST 2009


2009/11/16 Baz <brian.ewins at gmail.com>:
> 2009/11/15 Albert Astals Cid <aacid at kde.org>:
>> A Diumenge, 15 de novembre de 2009, Carlos Garcia Campos va escriure:
>>> Excerpts from Baz's message of vie nov 13 12:56:26 +0100 2009:
>>> > Hi,
>>>
>>> Hi Brian,
>>>
>>> > I uploaded a new version of my multicolumn select patches to
>>> > https://bugs.freedesktop.org/show_bug.cgi?id=3188 this morning, as you
>>> > might've seen.
>>>
>>> Yes, it's great to know you are working on this again :-) thank you
>>> very much.
>>>
>>> > This version uses a similar algorithm to ocropus to
>>> > determine reading order, and tries to make the selection follow this
>>> > reading order. Its looking fairly good now I think - for all but one
>>> > of the documents I tested with it picked a reasonable order, and
>>> > selection doesn't jump all over the place. Of course, I've only tested
>>> > on the handful of docs that were in the bug reports so I might've made
>>> > things worse elsewhere :(
>>>
>>> I've just tried it and I've found some issues, see self-explanatory
>>> screenshots:
>>>
>>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue1.png
>>> http://people.freedesktop.org/~carlosgc/poppler-multi-column-issue2.png
>>>
>>> The line selection (triple-click) seems to be broken too.
>>>
>>> > I was wondering what I can do to get these patches into an acceptable
>>> > state. There's some obvious issues still to iron out, eg RTL (see
>>> > http://bugs.kde.org/show_bug.cgi?id=156380 ,
>>> > http://bugs.kde.org/show_bug.cgi?id=184399) and handling blocks with
>>> > non-zero rotation; also the new depth_first_visit method I added is in
>>> > the wrong class - should probably be in TextBlock. I'll fix this up.
>>>
>>> Current behaviour has been broken for a long time, any improvement
>>> even still a bit broken, is very appreciated.
>>>
>>> > But beyond that, these patches might be problematic because they
>>> > remove the old selection behaviour. The new behaviour is much better
>>> > for multicolumn documents, but is likely to be worse at selecting data
>>> > out of tables, for example. Should the new selection mode introduce
>>> > new API, so as not to change the current behaviour of Evince &
>>> > Okular[1]?
>>>
>>> Having a new API would definitely make things easier, yes.
>>>
>>> > In older versions of acrobat, they had 'table select' and
>>> > 'text select' modes, covering these two uses, but more recently table
>>> > select has been dropped entirely. I suspect that they now just follow
>>> > the tags in tagged pdf, with the fallback behaviour being something
>>> > like what I've coded up here.
>>> >
>>> > Also, testing. At the moment, testing for me consists of opening a
>>> > bunch of documents in Evince and selecting stuff randomly (I don't
>>> > have Okular, but since they use the same API for text selection I
>>> > presume the bug is the same).
>>>
>>> Well, Okular doesn't use TextOutputDev for selecting, but it does for
>>> extracting the text, so it will be affected anyway.
>>>
>>> > I have no idea if I'm introducing
>>> > regressions. Is there a plan to integrate the unit test framework that
>>> > was discussed previously?
>>> > http://lists.freedesktop.org/archives/poppler/2009-March/004535.html
>>> > .
>>>
>>> Yes, but I didn't manage to get it working without crashing :-(
>>>
>>> > Or failing that, is there a pool somewhere of test documents for
>>> > poppler/evince/okular?
>>>
>>> Yes, Albert has a regression test script, so he can run it with your
>>> patches applied.
>>
>> Is it enough if i run pdftotext and compare it's output?
>
> Probably not. The bug was all about selecting regions from the page.
> I've also not paid any attention to whether this code (visitSelection,
> etc) is exercised by pdftotext or whether that just loops through the
> blocks. However, I guess its worth checking if I've introduced a
> regression there. If you're hitting the new code, its definitely got
> different output for multicolumn since I don't preserve layout.

I've checked now... yes pdftotext with no flags will hit the new
reading order code.

>
>>
>> Should i do it now or wait for a patch that fixes the issues you've pointed
>> out?
>
> The code now is fine to test with. All thats left to do is RTL.
>
> Cheers,
> Baz
>
>>
>> Albert
>>
>>>
>>> > Particularly if someone has docs with rotated
>>> > blocks, and an RTL doc to test; neither the RTL selection or search
>>> > bugs had docs attached; also vertical text I guess.
>>> >
>>> > Cheers,
>>> > Baz
>>>
>>> We are closer to fix it, keep up the good work!
>>>
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>


More information about the poppler mailing list