[poppler] pdftotext and pdftohtml and extracting text

Valerio Messina efa at iol.it
Fri Aug 25 20:55:32 UTC 2017


On 25/08/2017 19:17, Alex wrote:
> I'm trying to use the poppler-utils to work with a spamassassin plugin
> to extract text from PDFs that may be malicious.

antiphish tools, admiration


> Would someone be interested in trying to extract the URL from within
> this PDF for me?

cat pdf-phish.pdf | tr '\0 ' '\n' | grep http | sed 
's/.*[(<]\(http.*\)[>)].*/\1/'

in case, you can add some other tag delimitation chars inside square 
brackets


> Is there a big difference between version 0.45 and
> the latest that may affect this?

0.57.0 does not find the link with that PDF, not know why

Valerio


More information about the poppler mailing list