[poppler] pdftotext and pdftohtml and extracting text
Adrian Johnson
ajohnson at redneon.com
Fri Aug 25 21:58:46 UTC 2017
On 26/08/17 02:47, Alex wrote:
> Hi,
> I'm attempting to use pdftohtml and pdftotext on fedora25
> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
> extract the text from a particular PDF I need.
>
> I'm trying to use the poppler-utils to work with a spamassassin plugin
> to extract text from PDFs that may be malicious. Here is one such
> example:
>
> https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0
>
> It appears to extract the header information (author, date, etc) but
> no text from within the PDF.
That's because there is no text in the PDF.
Here's the content stream:
stream
/P <</MCID 0>> BDC BT
/F1 11.04 Tf
1 0 0 1 72.024 63.48 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[( )] TJ
ET
EMC /Span <</MCID 1>> BDC q
572.04 0 0 698.52 15.96 73.92 cm
/Image10 Do Q
EMC
endstream
The only text is a single space character. The rest is an image. There
is also a link annotation. Maybe we could add an option to pdfinfo to
list the annotations in the file and for link annotations show the URL.
>
> Would someone be interested in trying to extract the URL from within
> this PDF for me? Is there a big difference between version 0.45 and
> the latest that may affect this? It would require compiling it here
> locally.
>
> podofopdfinfo is able to identify the URL within the PDF, but I'm not
> sure if that's helpful.
>
> Any ideas greatly appreciated.
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
>
More information about the poppler
mailing list