[poppler] pdftotext and pdftohtml and extracting text

Fri Aug 25 21:58:46 UTC 2017

On 26/08/17 02:47, Alex wrote:
> Hi,
> I'm attempting to use pdftohtml and pdftotext on fedora25
> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
> extract the text from a particular PDF I need.
> 
> I'm trying to use the poppler-utils to work with a spamassassin plugin
> to extract text from PDFs that may be malicious. Here is one such
> example:
> 
> https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0
> 
> It appears to extract the header information (author, date, etc) but
> no text from within the PDF.

That's because there is no text in the PDF.

Here's the content stream:

  stream
   /P <</MCID 0>> BDC BT

  /F1 11.04 Tf

  1 0 0 1 72.024 63.48 Tm

  /GS7 gs

  0 g

  /GS8 gs

  0 G

  [( )] TJ

  ET

   EMC  /Span <</MCID 1>> BDC q

  572.04 0 0 698.52 15.96 73.92 cm

  /Image10 Do Q

   EMC
  endstream

The only text is a single space character. The rest is an image. There
is also a link annotation. Maybe we could add an option to pdfinfo to
list the annotations in the file and for link annotations show the URL.

> 
> Would someone be interested in trying to extract the URL from within
> this PDF for me? Is there a big difference between version 0.45 and
> the latest that may affect this? It would require compiling it here
> locally.
> 
> podofopdfinfo is able to identify the URL within the PDF, but I'm not
> sure if that's helpful.
> 
> Any ideas greatly appreciated.
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler
>