<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><div>In one message or another, Mark Ehle <markehle@gmail.com> said something like this:</markehle@gmail.com></div><blockquote type="cite"><div dir="ltr">I am using pdtotxt to extract text from pdf file in a digital newspaper archive I am creating for a local public library. So far, it's working great. But - I am using up a far amount of disk space and would like to figure out a way to create an OCR'd pdf from an image and the bounding box data. That way I would not have to store the PDF files as well as the images. Is there a way to do that?<br></div></blockquote></div><div><br></div><div>Seems like you would want to store the PDF instead of the images. Anyway, you should look at Tesseract:</div><div><br></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><a href="https://code.google.com/p/tesseract-ocr/">https://code.google.com/p/tesseract-ocr/</a></div></blockquote><div><br></div><div>I haven't used it myself but, my understanding is, it'll embedded the OCR'd data into the PDF itself allowing searching, text selection, etc. from a PDF viewer.</div><div><br></div><div>-e</div><div apple-content-edited="true">
--<br>Ed Porras<br><a href="mailto:ed@moto-research.com">ed@moto-research.com</a>
</div>
<br></body></html>