<div dir="ltr">Indeed, the problematic PDF files do use render mode 3. At first I thought I might use the number of fonts a PDF uses to determine which ones had this hidden OCR, but some documents have quite a large number of fonts in them considering the whole thing is images and hidden text.<div><br></div><div>I don't see a way with the Poppler C++ API to determine if text is using render mode 3. The only thing provided is the text box rectangle and the text itself.</div><div><br></div><div>At the moment, I've uncompressed the PDF using "podofouncompress" and in the results I see stuff like this:</div><div><br></div><div>stream<br>BT<br>3 Tr<br>0.00 Tc<br></div><div><br></div><div>From what I can tell, the Poppler tools and API don't offer any public means to uncompress a PDF file. Looking into how that works, hoping there is a way to do it programmatically without having to use system() calls to a 3rd party tool.</div><div><br></div><div>Thanks for the hint about render mode 3.</div><div><br></div><div>Stéphane</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 14, 2022 at 2:00 PM Leonard Rosenthol <<a href="mailto:lrosenth@adobe.com">lrosenth@adobe.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-7892599389229368467">
<div lang="EN-US" style="overflow-wrap: break-word;">
<div class="m_2377736670543538142WordSection1">
<p class="MsoNormal"><span style="font-size:11pt">There are many different ways to add OCR’d text to a PDF, though one of the most common is use of “hidden text”, where the text is drawn using Text Render Mode 3. I don’t recall if Poppler exposes this information
in the public APIs, but it certainly has it in the graphic state internally.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt">Leonard<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(181,196,223);padding:3pt 0in 0in">
<p class="MsoNormal" style="margin-bottom:12pt"><b><span style="font-size:12pt;color:black">From:
</span></b><span style="font-size:12pt;color:black">poppler <<a href="mailto:poppler-bounces@lists.freedesktop.org" target="_blank">poppler-bounces@lists.freedesktop.org</a>> on behalf of Stéphane Charette <<a href="mailto:stephanecharette@gmail.com" target="_blank">stephanecharette@gmail.com</a>><br>
<b>Date: </b>Friday, October 14, 2022 at 2:54 PM<br>
<b>To: </b><a href="mailto:poppler@lists.freedesktop.org" target="_blank">poppler@lists.freedesktop.org</a> <<a href="mailto:poppler@lists.freedesktop.org" target="_blank">poppler@lists.freedesktop.org</a>><br>
<b>Subject: </b>[poppler] getting the text from PDF files<u></u><u></u></span></p>
</div>
<p><strong><span style="font-size:10.5pt;font-family:Calibri,sans-serif;color:black">EXTERNAL: Use caution when clicking on links or opening attachments.</span></strong><u></u><u></u></p>
<p><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:11pt">Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files. Works well.<br clear="all">
<u></u><u></u></span></p>
<div>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt">doc->create_page(idx) to get the page, then page->text_list() to get all the boxes. PDFs seem to either have text, or if it was a scan then I have an image with no text, and I fall back to other techniques
to read what I need.<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt">But...! Some fax machines and business scanners try to do OCR, and embeds the text results into the PDF. The quality of the OCR is poor, but when I attempt to extract the text, I do get back the expected
text boxes which leads me down the wrong path.<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt">Is there anything in the way the text was added to the PDF that I can use as a hint that the text was added to the PDF after-the-fact, and not as part of the original PDF creation process? Something I can
use to determine if the text can be trusted? Reading up on things like Xref tables to get an understanding of the internals of PDF files so I can attempt to find a pattern between my "good" and "problematic" PDF files. Wondered if there was a way to see
if the text is part of the page itself, or if it was tacked on afterwards.<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt">Thanks,<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt">Stéphane<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
</div>
<p class="MsoNormal"><span style="font-size:11pt">-- <u></u><u></u></span></p>
<div>
<div>
<div>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="107" valign="bottom" style="width:80.25pt;padding:15pt 7.5pt 15pt 0in">
<p class="MsoNormal" style="line-height:0%"><a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0" target="_blank"><span style="font-size:11pt;color:windowtext;text-decoration:none"><span style="color:blue;border:1pt solid windowtext;padding:0in"><img border="0" width="105" height="70" style="width: 1.0937in; height: 0.7291in;" id="m_2377736670543538142_x0000_i1026" alt="Image removed by sender."></span></span></a><span style="font-size:11pt"><u></u><u></u></span></p>
</td>
<td valign="bottom" style="padding:15pt 0in">
<p class="MsoNormal"><span style="font-size:11pt;border:1pt solid windowtext;padding:0in"><img border="0" width="1" height="1" style="width: 0.0104in; height: 0.0104in;" id="m_2377736670543538142_x0000_i1025" alt="Image removed by sender."></span><span style="font-size:11pt"><u></u><u></u></span></p>
<div>
<p class="MsoNormal"><b><span style="font-size:13.5pt;font-family:Helvetica;color:rgb(51,51,51)">Stéphane Charette<u></u><u></u></span></b></p>
</div>
<p class="MsoNormal"><span style="font-size:11pt"><a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fabout.me%2Fstephane.charette%3Fpromo%3Demail_sig%26utm_source%3Dproduct%26utm_medium%3Demail_sig%26utm_campaign%3Dedit_panel%26utm_content%3Dthumb&data=05%7C01%7Clrosenth%40adobe.com%7C929dbafc69344f80df8f08daae159382%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C638013704942713530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8sHxTZ4vVD6XTu1Vro0Bjm%2Fl1lUVdXU6hLVgXqVG0Uw%3D&reserved=0" target="_blank"><span style="font-size:9pt;font-family:Helvetica;color:rgb(43,130,173)">about.me/stephane.charette</span></a><u></u><u></u></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span style="font-size:11pt"><u></u> <u></u></span></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div></blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td align="left" valign="bottom" width="107" style="line-height:0;vertical-align:bottom;padding-right:10px;padding-top:20px;padding-bottom:20px"><a href="https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb" target="_blank">
<img src="https://thumbs.about.me/thumbnail/users/s/t/e/stephane.charette_emailsig.jpg?_1613974105_595" alt="" width="105" height="70" style="margin: 0px; padding: 0px; display: block; border: 1px solid rgb(238, 238, 238);"></a>
</td><td align="left" valign="bottom" style="line-height:1.1;vertical-align:bottom;padding-top:20px;padding-bottom:20px"><img src="https://about.me/t/sig?u=stephane.charette" width="1" height="1" style="border: 0px; margin: 0px; padding: 0px; overflow: hidden;">
<div style="font-size:18px;font-weight:bold;color:rgb(51,51,51);font-family:"Proxima Nova",Helvetica,Arial,sans-serif">Stéphane Charette</div><a href="https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb" style="font-size:12px;color:rgb(43,130,173);font-family:"Proxima Nova",Helvetica,Arial,sans-serif" target="_blank">about.me/stephane.charette</a></td></tr></tbody></table></div>
</div></div>