[poppler] New selection algorithm

Daniel Garcia Moreno danigm at yaco.es
Mon Sep 6 01:31:01 PDT 2010


Poppler does not make table selection in "order". It detects tables as
columns, because poppler uses distance between text to decide what is a
column so tables are selected in column order when the "logic way" is as
rows.

Other problem in selection caused by that heuristic is when you have a
pdf with near columns or text with spaces.

I looked at acroread to see how it does columns and tables selection and
I realized that it selects text in "order", I mean, in the order that
you put it in pdf file. To see that I created a text pdf file with
inkscape.

So the selection logic is simple, we select the nearest word to the
first selection point and the nearest word to the last selection point,
and every word between that two words (in text order, no matter where
the words are at screen) is selected too.

I have implemented [1] that logic and it seems to work better that
current one. I made a video to show the new logic implemented in action
[2].

To implement that I use TextWordList in TextPage, and to get that list
well ordered I create TextOutputDev as rawOrder in selection, I have
change that only in glib frontend so other frontends maybe don't work
ok.

So the big implementation problem is to find the first and the last
index in wordlist that defines the selection, and it is an easy
algorithm. And for RTL documents I reverse wordlist by line and change
word selection index, so the algorithm works with RTL too.

So, what do you think about that new selection algorithm? It seems that
works better than current one, and it's simpler, but I don't know if I
forget something about selection or maybe performance...

I attach the patch, it's divided in two commits, and maybe commit
messages aren't *correct*.

[1] http://github.com/danigm/poppler/commits/selection
[2] http://www.youtube.com/watch?v=9bRH1yLCs4o


More information about the poppler mailing list