[poppler] getting the text from PDF files
Stéphane Charette
stephanecharette at gmail.com
Fri Oct 14 18:54:34 UTC 2022
Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files. Works well.
doc->create_page(idx) to get the page, then page->text_list() to get all
the boxes. PDFs seem to either have text, or if it was a scan then I have
an image with no text, and I fall back to other techniques to read what I
need.
But...! Some fax machines and business scanners try to do OCR, and embeds
the text results into the PDF. The quality of the OCR is poor, but when I
attempt to extract the text, I do get back the expected text boxes which
leads me down the wrong path.
Is there anything in the way the text was added to the PDF that I can use
as a hint that the text was added to the PDF after-the-fact, and not as
part of the original PDF creation process? Something I can use to
determine if the text can be trusted? Reading up on things like Xref
tables to get an understanding of the internals of PDF files so I can
attempt to find a pattern between my "good" and "problematic" PDF files.
Wondered if there was a way to see if the text is part of the page itself,
or if it was tacked on afterwards.
Thanks,
Stéphane
--
<https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
Stéphane Charette
about.me/stephane.charette
<https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20221014/761c6e2a/attachment.htm>
More information about the poppler
mailing list