[poppler] getting the text from PDF files

Sat Oct 15 19:53:23 UTC 2022

Indeed, the problematic PDF files do use render mode 3.  At first I thought
I might use the number of fonts a PDF uses to determine which ones had this
hidden OCR, but some documents have quite a large number of fonts in them
considering the whole thing is images and hidden text.

I don't see a way with the Poppler C++ API to determine if text is using
render mode 3.  The only thing provided is the text box rectangle and the
text itself.

At the moment, I've uncompressed the PDF using "podofouncompress" and in
the results I see stuff like this:

stream
BT
3 Tr
0.00 Tc

>From what I can tell, the Poppler tools and API don't offer any public
means to uncompress a PDF file.  Looking into how that works, hoping there
is a way to do it programmatically without having to use system() calls to
a 3rd party tool.

Thanks for the hint about render mode 3.

Stéphane

On Fri, Oct 14, 2022 at 2:00 PM Leonard Rosenthol <lrosenth at adobe.com>
wrote:

> There are many different ways to add OCR’d text to a PDF, though one of
> the most common is use of “hidden text”, where the text is drawn using Text
> Render Mode 3.  I don’t recall if Poppler exposes this information in the
> public APIs, but it certainly has it in the graphic state internally.
>
>
>
> Leonard
>
>
>
> *From: *poppler <poppler-bounces at lists.freedesktop.org> on behalf of
> Stéphane Charette <stephanecharette at gmail.com>
> *Date: *Friday, October 14, 2022 at 2:54 PM
> *To: *poppler at lists.freedesktop.org <poppler at lists.freedesktop.org>
> *Subject: *[poppler] getting the text from PDF files
>
> *EXTERNAL: Use caution when clicking on links or opening attachments.*
>
>
>
> Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files.  Works well.
>
>
>
> doc->create_page(idx) to get the page, then page->text_list() to get all
> the boxes.  PDFs seem to either have text, or if it was a scan then I have
> an image with no text, and I fall back to other techniques to read what I
> need.
>
>
>
> But...!  Some fax machines and business scanners try to do OCR, and embeds
> the text results into the PDF.  The quality of the OCR is poor, but when I
> attempt to extract the text, I do get back the expected text boxes which
> leads me down the wrong path.
>
>
>
> Is there anything in the way the text was added to the PDF that I can use
> as a hint that the text was added to the PDF after-the-fact, and not as
> part of the original PDF creation process?  Something I can use to
> determine if the text can be trusted?  Reading up on things like Xref
> tables to get an understanding of the internals of PDF files so I can
> attempt to find a pattern between my "good" and "problematic" PDF files.
> Wondered if there was a way to see if the text is part of the page itself,
> or if it was tacked on afterwards.
>
>
>
> Thanks,
>
>
>
> Stéphane
>
>
>
> --
>
> [image: Image removed by sender.]
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0>
>
> [image: Image removed by sender.]
>
> *Stéphane Charette*
>
> about.me/stephane.charette
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0>
>
>
>

-- 
<https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
Stéphane Charette
about.me/stephane.charette
<https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20221015/51ecd190/attachment.htm>