[poppler] New selection algorithm

Lorenzo Gil lgs at yaco.es
Tue Sep 7 14:18:46 PDT 2010


On Tue, Sep 07, 2010 at 11:05:05AM -0700, Leonard Rosenthol wrote:
> I can tell you with 100% certainty that Acrobat/Reader do NOT use raw order - they use "reading order".   The algorithms haven't changed between 8 & 9.  
> 

Ok

> I also looked at your PDFs and in both cases, OO is writing the content streams in exactly the same way & order - top to bottom, left to right.  It doesn't write the first column and then the second in the "real-column" example.  Open up the PDF's content stream and look.  You'll see almost identical streams.

I don't know how to see the PDF's content stream. I have try with the "Save as text" option in Acrobat/Reader and it does what you say. Still that doesn't explain why Acrobat/Reader does no select the right thing in the fake-columns example.

> 
> And Adobe Reader 9.3.4 is the current version for Linux - I just checked on Adobe.com.

I double checked and the problem was the language preference of my browser. The latest version of Acrobat/Reader in Spanish is 8.1.7. If you choose the English version then you are right and the latest one is 9.3.4.

Best regards,

Lorenzo

> 
> 
> Leonard Rosenthol
> PDF Standards Architect
> Adobe Systems
> 
> -----Original Message-----
> From: Lorenzo Gil [mailto:lgs at yaco.es] 
> Sent: Tuesday, September 07, 2010 2:00 PM
> To: Leonard Rosenthol
> Cc: 'Albert Astals Cid'; poppler at lists.freedesktop.org
> Subject: Re: [poppler] New selection algorithm
> 
> On Mon, Sep 06, 2010 at 01:45:55PM -0700, Leonard Rosenthol wrote:
> > >I don't think raw order is acceptable.
> > >
> > Agreed - never use raw order since it means nothing.
> > 
> > You should either use "reading order" (top->bottom, left->right (or RTL, depending)) as computed through geometric sorting - which is what the current code does, at least to some extent.
> > 
> > The difference with Acrobat/Reader is that we use additional heuristics to offer smarter selection semantics for columnar data, vertical text, and other such things.
> 
> I've created two pdf files (attached to this mail) with OpenOffice that looks pretty much the same in terms of layout and structure. Acrobat/Reader behaves completely different in terms of selection: in the real-columns.pdf it selects the text by columns but in the fake-columns if selects the text by lines. In both cases Adobe Reader selects the text in the order that OpenOffice put it in the document stream (e.g. raw order). The fake-columns.pdf document was created using tabs and spaces to simulate a two columns layout instead of the columns feature of OpenOffice.
> 
> I'm using Adobe Reader 8.1.7 for Linux. Maybe the heuristics that you mention were added to Adobe Reader 9 but unfortunately that's not available in Linux.
> 
> Sorry to focus on Adobe Reader when this is Poppler list but I think we should see Adobe Reader as the reference implementation for a PDF viewer.
> 
> Best regards,
> 
> Lorenzo
> 


More information about the poppler mailing list