[poppler] New selection algorithm

Albert Astals Cid aacid at kde.org
Mon Sep 6 12:55:01 PDT 2010


A Dilluns, 6 de setembre de 2010, vàreu escriure:
> On Mon, Sep 06, 2010 at 08:30:14PM +0100, Albert Astals Cid wrote:
> > A Dilluns, 6 de setembre de 2010, Daniel Garcia Moreno va escriure:
> > > Poppler does not make table selection in "order". It detects tables as
> > > columns, because poppler uses distance between text to decide what is a
> > > column so tables are selected in column order when the "logic way" is
> > > as rows.
> > > 
> > > Other problem in selection caused by that heuristic is when you have a
> > > pdf with near columns or text with spaces.
> > > 
> > > I looked at acroread to see how it does columns and tables selection
> > > and I realized that it selects text in "order", I mean, in the order
> > > that you put it in pdf file. To see that I created a text pdf file
> > > with inkscape.
> > > 
> > > So the selection logic is simple, we select the nearest word to the
> > > first selection point and the nearest word to the last selection point,
> > > and every word between that two words (in text order, no matter where
> > > the words are at screen) is selected too.
> > 
> > What is "text order"?
> 
> I think Dani means raw order, or the order in which the PDF creation tool
> put the text into the PDF file. For example, when authoring tables with
> OpenOffice, it generates the text in row order. When using a vector
> drawing tool like Inkscape the order matches the order the user created
> the objects. We belive selection should respect this order since it makes
> the right thing in most cases and the algorithm is so much simpler to
> understand and to maintain.
> 
> As Leonard mentioned we should also use PDF structure/tagging into account
> but I don't think we should elaborate heuristics for guessing how the text
> should be selected (e.g., trying to see columns where the user did some
> ascii art).
> 
> By the way, I'm Dani's coworker and I helped him with the algorithm design.

I don't think raw order is acceptable.

It might work with your files since the pdf creator put them in a nice raw 
order, but raw order is raw and nothing guarantees it will be in a logic 
order.

Albert

> 
> Best regards,
> 
> Lorenzo Gil


More information about the poppler mailing list