[poppler] Combine bounding box data and tiff to create pdf?

Ed Porras ed at moto-research.com
Thu May 8 06:23:15 PDT 2014


In one message or another, Mark Ehle said something like this:
> I am using pdtotxt to extract text from pdf file in a digital newspaper archive I am creating for a local public library. So far, it's working great. But - I am using up a far amount of disk space and would like to figure out a way to create an OCR'd pdf from an image and the bounding box data. That way I would not have to store the PDF files as well as the images. Is there a way to do that?


Seems like you would want to store the PDF instead of the images. Anyway, you should look at Tesseract:

https://code.google.com/p/tesseract-ocr/

I haven't used it myself but, my understanding is, it'll embedded the OCR'd data into the PDF itself allowing searching, text selection, etc. from a PDF viewer.

-e
--
Ed Porras
ed at moto-research.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20140508/c5e3b656/attachment.html>


More information about the poppler mailing list