<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body> <div> <a class="bz_bug_link bz_status_RESOLVED bz_closed" title="RESOLVED FIXED - Can't extract text/html from PDF" href="https://bugs.freedesktop.org/show_bug.cgi?id=97276#c6">Comment # 6</a> on <a class="bz_bug_link bz_status_RESOLVED bz_closed" title="RESOLVED FIXED - Can't extract text/html from PDF" href="https://bugs.freedesktop.org/show_bug.cgi?id=97276">bug 97276</a> from <a class="email" href="mailto:jason@aquaticape.us" title="Jason Crain <jason@aquaticape.us>"> Jason Crain</a> <pre>(In reply to clark from <a href="show_bug.cgi?id=97276#c5">comment #5</a>) > Is it possible to detect/check if a PDF is broken and return something > unreadable like this? > > I just need to check PDF files an mark broken files where the extracted text > is garbage My usual way of checking if a PDF is broken is to try it in a few different viewers and manually inspect the results. I don't have an automated way of doing this and there's not anything in a PDF that will let us predict that the output is going to be garbage, at least not reliably. Maybe you could put something together using aspell and say that it's bad if more than half of the words are misspelled or malformed (not tested): bad_count=$(aspell list < file.txt | wc -l) clean_count=$(aspell clean < file.txt | wc -l) is_bad=$(expr $bad_count \> $ $clean_count / 2 $)</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>