[poppler] pdftotext and pdftohtml and extracting text
Valerio Messina
efa at iol.it
Fri Aug 25 20:55:32 UTC 2017
On 25/08/2017 19:17, Alex wrote:
> I'm trying to use the poppler-utils to work with a spamassassin plugin
> to extract text from PDFs that may be malicious.
antiphish tools, admiration
> Would someone be interested in trying to extract the URL from within
> this PDF for me?
cat pdf-phish.pdf | tr '\0 ' '\n' | grep http | sed
's/.*[(<]\(http.*\)[>)].*/\1/'
in case, you can add some other tag delimitation chars inside square
brackets
> Is there a big difference between version 0.45 and
> the latest that may affect this?
0.57.0 does not find the link with that PDF, not know why
Valerio
More information about the poppler
mailing list