[poppler] New selection algorithm

Lorenzo Gil lgs at yaco.es
Tue Sep 7 11:00:26 PDT 2010


On Mon, Sep 06, 2010 at 01:45:55PM -0700, Leonard Rosenthol wrote:
> >I don't think raw order is acceptable.
> >
> Agreed - never use raw order since it means nothing.
> 
> You should either use "reading order" (top->bottom, left->right (or RTL, depending)) as computed through geometric sorting - which is what the current code does, at least to some extent.
> 
> The difference with Acrobat/Reader is that we use additional heuristics to offer smarter selection semantics for columnar data, vertical text, and other such things.

I've created two pdf files (attached to this mail) with OpenOffice that looks pretty much the same in terms of layout and structure. Acrobat/Reader behaves completely different in terms of selection: in the real-columns.pdf it selects the text by columns but in the fake-columns if selects the text by lines. In both cases Adobe Reader selects the text in the order that OpenOffice put it in the document stream (e.g. raw order). The fake-columns.pdf document was created using tabs and spaces to simulate a two columns layout instead of the columns feature of OpenOffice.

I'm using Adobe Reader 8.1.7 for Linux. Maybe the heuristics that you mention were added to Adobe Reader 9 but unfortunately that's not available in Linux.

Sorry to focus on Adobe Reader when this is Poppler list but I think we should see Adobe Reader as the reference implementation for a PDF viewer.

Best regards,

Lorenzo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: real-columns.pdf
Type: application/pdf
Size: 28627 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20100907/90cdc574/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fake-columns.pdf
Type: application/pdf
Size: 28869 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20100907/90cdc574/attachment-0003.pdf>


More information about the poppler mailing list