[poppler] getting the text from PDF files

Fri Oct 14 21:00:03 UTC 2022

There are many different ways to add OCR’d text to a PDF, though one of the most common is use of “hidden text”, where the text is drawn using Text Render Mode 3.  I don’t recall if Poppler exposes this information in the public APIs, but it certainly has it in the graphic state internally.

Leonard

From: poppler <poppler-bounces at lists.freedesktop.org> on behalf of Stéphane Charette <stephanecharette at gmail.com>
Date: Friday, October 14, 2022 at 2:54 PM
To: poppler at lists.freedesktop.org <poppler at lists.freedesktop.org>
Subject: [poppler] getting the text from PDF files

EXTERNAL: Use caution when clicking on links or opening attachments.

Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files.  Works well.

doc->create_page(idx) to get the page, then page->text_list() to get all the boxes.  PDFs seem to either have text, or if it was a scan then I have an image with no text, and I fall back to other techniques to read what I need.

But...!  Some fax machines and business scanners try to do OCR, and embeds the text results into the PDF.  The quality of the OCR is poor, but when I attempt to extract the text, I do get back the expected text boxes which leads me down the wrong path.

Is there anything in the way the text was added to the PDF that I can use as a hint that the text was added to the PDF after-the-fact, and not as part of the original PDF creation process?  Something I can use to determine if the text can be trusted?  Reading up on things like Xref tables to get an understanding of the internals of PDF files so I can attempt to find a pattern between my "good" and "problematic" PDF files.  Wondered if there was a way to see if the text is part of the page itself, or if it was tacked on afterwards.

Thanks,

Stéphane

--
[Image removed by sender.]<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0>

Stéphane Charette
about.me/stephane.charette<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20221014/6df489a8/attachment.htm>