[poppler] pdftotext and pdftohtml and extracting text
ajohnson at redneon.com
Fri Aug 25 21:58:46 UTC 2017
On 26/08/17 02:47, Alex wrote:
> I'm attempting to use pdftohtml and pdftotext on fedora25
> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
> extract the text from a particular PDF I need.
> I'm trying to use the poppler-utils to work with a spamassassin plugin
> to extract text from PDFs that may be malicious. Here is one such
> It appears to extract the header information (author, date, etc) but
> no text from within the PDF.
That's because there is no text in the PDF.
Here's the content stream:
/P <</MCID 0>> BDC BT
/F1 11.04 Tf
1 0 0 1 72.024 63.48 Tm
[( )] TJ
EMC /Span <</MCID 1>> BDC q
572.04 0 0 698.52 15.96 73.92 cm
/Image10 Do Q
The only text is a single space character. The rest is an image. There
is also a link annotation. Maybe we could add an option to pdfinfo to
list the annotations in the file and for link annotations show the URL.
> Would someone be interested in trying to extract the URL from within
> this PDF for me? Is there a big difference between version 0.45 and
> the latest that may affect this? It would require compiling it here
> podofopdfinfo is able to identify the URL within the PDF, but I'm not
> sure if that's helpful.
> Any ideas greatly appreciated.
> poppler mailing list
> poppler at lists.freedesktop.org
More information about the poppler