[poppler] pdftotext and pdftohtml and extracting text

Sat Aug 26 01:50:22 UTC 2017

Hi,

On Fri, Aug 25, 2017 at 5:58 PM, Adrian Johnson <ajohnson at redneon.com> wrote:
> On 26/08/17 02:47, Alex wrote:
>> Hi,
>> I'm attempting to use pdftohtml and pdftotext on fedora25
>> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
>> extract the text from a particular PDF I need.
>>
>> I'm trying to use the poppler-utils to work with a spamassassin plugin
>> to extract text from PDFs that may be malicious. Here is one such
>> example:
>>
>> https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0
>>
>> It appears to extract the header information (author, date, etc) but
>> no text from within the PDF.
>
> That's because there is no text in the PDF.
>
> Here's the content stream
...
> The only text is a single space character. The rest is an image. There
> is also a link annotation. Maybe we could add an option to pdfinfo to
> list the annotations in the file and for link annotations show the URL.

Yes, I thought that might have been the problem, but the URL is what I
was specifically talking about.

That signature alone might be very helpful for identifying these
malicious PDFs. Does the presence of a single space character with an
image sound like a unique pattern, and if so, how can I encapsulate
that into something I can trigger on? In other words, even something
like an exit code or other indication that it's the only real content
would be helpful.

Thanks,
Alex