[poppler] New selection algorithm

Lorenzo Gil lgs at yaco.es
Mon Sep 6 12:42:56 PDT 2010


On Mon, Sep 06, 2010 at 08:30:14PM +0100, Albert Astals Cid wrote:
> A Dilluns, 6 de setembre de 2010, Daniel Garcia Moreno va escriure:
> > Poppler does not make table selection in "order". It detects tables as
> > columns, because poppler uses distance between text to decide what is a
> > column so tables are selected in column order when the "logic way" is as
> > rows.
> > 
> > Other problem in selection caused by that heuristic is when you have a
> > pdf with near columns or text with spaces.
> > 
> > I looked at acroread to see how it does columns and tables selection and
> > I realized that it selects text in "order", I mean, in the order that
> > you put it in pdf file. To see that I created a text pdf file with
> > inkscape.
> > 
> > So the selection logic is simple, we select the nearest word to the
> > first selection point and the nearest word to the last selection point,
> > and every word between that two words (in text order, no matter where
> > the words are at screen) is selected too.
> 
> What is "text order"?

I think Dani means raw order, or the order in which the PDF creation tool put
the text into the PDF file. For example, when authoring tables with OpenOffice,
it generates the text in row order. When using a vector drawing tool like
Inkscape the order matches the order the user created the objects. We belive
selection should respect this order since it makes the right thing in most
cases and the algorithm is so much simpler to understand and to maintain.

As Leonard mentioned we should also use PDF structure/tagging into account but
I don't think we should elaborate heuristics for guessing how the text should
be selected (e.g., trying to see columns where the user did some ascii art).

By the way, I'm Dani's coworker and I helped him with the algorithm design.

Best regards,

Lorenzo Gil


More information about the poppler mailing list