<div dir="ltr">Using libpoppler-cpp-dev 0.86.1 on Ubuntu to read PDF files. Works well.<br clear="all"><div><br></div><div>doc->create_page(idx) to get the page, then page->text_list() to get all the boxes. PDFs seem to either have text, or if it was a scan then I have an image with no text, and I fall back to other techniques to read what I need.</div><div><br></div><div>But...! Some fax machines and business scanners try to do OCR, and embeds the text results into the PDF. The quality of the OCR is poor, but when I attempt to extract the text, I do get back the expected text boxes which leads me down the wrong path.</div><div><br></div><div>Is there anything in the way the text was added to the PDF that I can use as a hint that the text was added to the PDF after-the-fact, and not as part of the original PDF creation process? Something I can use to determine if the text can be trusted? Reading up on things like Xref tables to get an understanding of the internals of PDF files so I can attempt to find a pattern between my "good" and "problematic" PDF files. Wondered if there was a way to see if the text is part of the page itself, or if it was tacked on afterwards.</div><div><br></div><div>Thanks,</div><div><br></div><div>Stéphane</div><div><br></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td align="left" valign="bottom" width="107" style="line-height:0;vertical-align:bottom;padding-right:10px;padding-top:20px;padding-bottom:20px"><a href="https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb" target="_blank">
<img src="https://thumbs.about.me/thumbnail/users/s/t/e/stephane.charette_emailsig.jpg?_1613974105_595" alt="" width="105" height="70" style="margin:0;padding:0;display:block;border:1px solid #eeeeee"></a>
</td><td align="left" valign="bottom" style="line-height:1.1;vertical-align:bottom;padding-top:20px;padding-bottom:20px"><img src="https://about.me/t/sig?u=stephane.charette" width="1" height="1" style="border:0;margin:0;padding:0;width:1;height:1;overflow:hidden">
<div style="font-size:18px;font-weight:bold;color:#333333;font-family:'Proxima Nova',Helvetica,Arial,sans-serif!important">Stéphane Charette</div><a href="https://about.me/stephane.charette?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=edit_panel&utm_content=thumb" style="font-size:12px;color:rgb(43,130,173);font-family:"Proxima Nova",Helvetica,Arial,sans-serif!important" target="_blank">about.me/stephane.charette</a></td></tr></tbody></table></div>
</div></div></div>