[poppler] improved ebook pdf handling

Mon Oct 5 00:51:53 PDT 2009

On Monday 05 October 2009 16:05:54 Randall Puljek-Shank wrote:
> I'd like to improve the pdftohtml handling of ebooks.  Here are the goals
> that I have:
> 1. Recognize table of contents and convert to links
> 2. Remove running headers and page numbers from the resulting text
> 3. Recognize columns
Page layout analysis is on my list of things I'd like to get to one day. Given 
the length of the list, it won't be in the next year.

I did do a bit of poking around with algorithms to do this, and ocropus 
appeared to have some reasonably good code that we might be able to reuse 
given some process separation (since it is incompatibly licensed). If re-use 
doesn't work, there are still a lot of good algorithms that might be suitable 
that are linked off the ocropus site. Suggest you start with 
http://code.google.com/p/ocropus/

At least Okular has a page layout wishlist item (to allow smarter selection 
during cut-n-paste of a region) so maybe there is a way to provide this as a 
general facility to other users.

Brad