[poppler] New selection algorithm

Leonard Rosenthol lrosenth at adobe.com
Tue Sep 7 11:05:05 PDT 2010


I can tell you with 100% certainty that Acrobat/Reader do NOT use raw order - they use "reading order".   The algorithms haven't changed between 8 & 9.  

I also looked at your PDFs and in both cases, OO is writing the content streams in exactly the same way & order - top to bottom, left to right.  It doesn't write the first column and then the second in the "real-column" example.  Open up the PDF's content stream and look.  You'll see almost identical streams.

And Adobe Reader 9.3.4 is the current version for Linux - I just checked on Adobe.com.


Leonard Rosenthol
PDF Standards Architect
Adobe Systems

-----Original Message-----
From: Lorenzo Gil [mailto:lgs at yaco.es] 
Sent: Tuesday, September 07, 2010 2:00 PM
To: Leonard Rosenthol
Cc: 'Albert Astals Cid'; poppler at lists.freedesktop.org
Subject: Re: [poppler] New selection algorithm

On Mon, Sep 06, 2010 at 01:45:55PM -0700, Leonard Rosenthol wrote:
> >I don't think raw order is acceptable.
> >
> Agreed - never use raw order since it means nothing.
> 
> You should either use "reading order" (top->bottom, left->right (or RTL, depending)) as computed through geometric sorting - which is what the current code does, at least to some extent.
> 
> The difference with Acrobat/Reader is that we use additional heuristics to offer smarter selection semantics for columnar data, vertical text, and other such things.

I've created two pdf files (attached to this mail) with OpenOffice that looks pretty much the same in terms of layout and structure. Acrobat/Reader behaves completely different in terms of selection: in the real-columns.pdf it selects the text by columns but in the fake-columns if selects the text by lines. In both cases Adobe Reader selects the text in the order that OpenOffice put it in the document stream (e.g. raw order). The fake-columns.pdf document was created using tabs and spaces to simulate a two columns layout instead of the columns feature of OpenOffice.

I'm using Adobe Reader 8.1.7 for Linux. Maybe the heuristics that you mention were added to Adobe Reader 9 but unfortunately that's not available in Linux.

Sorry to focus on Adobe Reader when this is Poppler list but I think we should see Adobe Reader as the reference implementation for a PDF viewer.

Best regards,

Lorenzo


More information about the poppler mailing list