[poppler] pdftotext and pdftohtml and extracting text

Alex mysqlstudent at gmail.com
Fri Aug 25 17:17:42 UTC 2017


Hi,
I'm attempting to use pdftohtml and pdftotext on fedora25
(poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
extract the text from a particular PDF I need.

I'm trying to use the poppler-utils to work with a spamassassin plugin
to extract text from PDFs that may be malicious. Here is one such
example:

https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0

It appears to extract the header information (author, date, etc) but
no text from within the PDF.

Would someone be interested in trying to extract the URL from within
this PDF for me? Is there a big difference between version 0.45 and
the latest that may affect this? It would require compiling it here
locally.

podofopdfinfo is able to identify the URL within the PDF, but I'm not
sure if that's helpful.

Any ideas greatly appreciated.


More information about the poppler mailing list