[poppler] getting the text from PDF files

Fri Oct 14 18:54:34 UTC 2022

Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files.  Works well.

doc->create_page(idx) to get the page, then page->text_list() to get all
the boxes.  PDFs seem to either have text, or if it was a scan then I have
an image with no text, and I fall back to other techniques to read what I
need.

But...!  Some fax machines and business scanners try to do OCR, and embeds
the text results into the PDF.  The quality of the OCR is poor, but when I
attempt to extract the text, I do get back the expected text boxes which
leads me down the wrong path.

Is there anything in the way the text was added to the PDF that I can use
as a hint that the text was added to the PDF after-the-fact, and not as
part of the original PDF creation process?  Something I can use to
determine if the text can be trusted?  Reading up on things like Xref
tables to get an understanding of the internals of PDF files so I can
attempt to find a pattern between my "good" and "problematic" PDF files.
Wondered if there was a way to see if the text is part of the page itself,
or if it was tacked on afterwards.

Thanks,

Stéphane

-- 
<https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
Stéphane Charette
about.me/stephane.charette
<https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20221014/761c6e2a/attachment.htm>