[Poppler-bugs] [Bug 97276] Can't extract text/html from PDF

Thu Aug 18 08:59:20 UTC 2016

https://bugs.freedesktop.org/show_bug.cgi?id=97276

--- Comment #6 from Jason Crain <jason at aquaticape.us> ---
(In reply to clark from comment #5)
> Is it possible to detect/check if a PDF is broken and return something
> unreadable like this?
> 
> I just need to check PDF files an mark broken files where the extracted text
> is garbage

My usual way of checking if a PDF is broken is to try it in a few different
viewers and manually inspect the results.  I don't have an automated way of
doing this and there's not anything in a PDF that will let us predict that the
output is going to be garbage, at least not reliably.

Maybe you could put something together using aspell and say that it's bad if
more than half of the words are misspelled or malformed (not tested):

bad_count=$(aspell list < file.txt | wc -l)
clean_count=$(aspell clean < file.txt | wc -l)
is_bad=$(expr $bad_count \> \( $clean_count / 2 \))

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/poppler-bugs/attachments/20160818/ce512ca5/attachment.html>