[poppler] improved ebook pdf handling
Brad Hards
bradh at frogmouth.net
Mon Oct 5 00:51:53 PDT 2009
On Monday 05 October 2009 16:05:54 Randall Puljek-Shank wrote:
> I'd like to improve the pdftohtml handling of ebooks. Here are the goals
> that I have:
> 1. Recognize table of contents and convert to links
> 2. Remove running headers and page numbers from the resulting text
> 3. Recognize columns
Page layout analysis is on my list of things I'd like to get to one day. Given
the length of the list, it won't be in the next year.
I did do a bit of poking around with algorithms to do this, and ocropus
appeared to have some reasonably good code that we might be able to reuse
given some process separation (since it is incompatibly licensed). If re-use
doesn't work, there are still a lot of good algorithms that might be suitable
that are linked off the ocropus site. Suggest you start with
http://code.google.com/p/ocropus/
At least Okular has a page layout wishlist item (to allow smarter selection
during cut-n-paste of a region) so maybe there is a way to provide this as a
general facility to other users.
Brad
More information about the poppler
mailing list