[poppler] New selection algorithm

Leonard Rosenthol lrosenth at adobe.com
Mon Sep 6 12:18:19 PDT 2010


I won't comment on the patch itself, but I will make two comments.

1) Your assumptions about how Acrobat/Reader work is incorrect.
2) You should consider taking PDF structure/tagging into account when present.

Leonard

-----Original Message-----
From: poppler-bounces+leonardr=adobe.com at lists.freedesktop.org [mailto:poppler-bounces+leonardr=adobe.com at lists.freedesktop.org] On Behalf Of Daniel Garcia Moreno
Sent: Monday, September 06, 2010 4:31 AM
To: poppler at lists.freedesktop.org
Subject: [poppler] New selection algorithm

Poppler does not make table selection in "order". It detects tables as columns, because poppler uses distance between text to decide what is a column so tables are selected in column order when the "logic way" is as rows.

Other problem in selection caused by that heuristic is when you have a pdf with near columns or text with spaces.

I looked at acroread to see how it does columns and tables selection and I realized that it selects text in "order", I mean, in the order that you put it in pdf file. To see that I created a text pdf file with inkscape.

So the selection logic is simple, we select the nearest word to the first selection point and the nearest word to the last selection point, and every word between that two words (in text order, no matter where the words are at screen) is selected too.

I have implemented [1] that logic and it seems to work better that current one. I made a video to show the new logic implemented in action [2].

To implement that I use TextWordList in TextPage, and to get that list well ordered I create TextOutputDev as rawOrder in selection, I have change that only in glib frontend so other frontends maybe don't work ok.

So the big implementation problem is to find the first and the last index in wordlist that defines the selection, and it is an easy algorithm. And for RTL documents I reverse wordlist by line and change word selection index, so the algorithm works with RTL too.

So, what do you think about that new selection algorithm? It seems that works better than current one, and it's simpler, but I don't know if I forget something about selection or maybe performance...

I attach the patch, it's divided in two commits, and maybe commit messages aren't *correct*.

[1] http://github.com/danigm/poppler/commits/selection
[2] http://www.youtube.com/watch?v=9bRH1yLCs4o


More information about the poppler mailing list