[poppler] pdftotext and pdftohtml and extracting text

Alex mysqlstudent at gmail.com
Sat Aug 26 01:50:22 UTC 2017


Hi,

On Fri, Aug 25, 2017 at 5:58 PM, Adrian Johnson <ajohnson at redneon.com> wrote:
> On 26/08/17 02:47, Alex wrote:
>> Hi,
>> I'm attempting to use pdftohtml and pdftotext on fedora25
>> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
>> extract the text from a particular PDF I need.
>>
>> I'm trying to use the poppler-utils to work with a spamassassin plugin
>> to extract text from PDFs that may be malicious. Here is one such
>> example:
>>
>> https://www.dropbox.com/s/b97pcvl1wm1oocq/pdf-phish.pdf?dl=0
>>
>> It appears to extract the header information (author, date, etc) but
>> no text from within the PDF.
>
> That's because there is no text in the PDF.
>
> Here's the content stream
...
> The only text is a single space character. The rest is an image. There
> is also a link annotation. Maybe we could add an option to pdfinfo to
> list the annotations in the file and for link annotations show the URL.

Yes, I thought that might have been the problem, but the URL is what I
was specifically talking about.

That signature alone might be very helpful for identifying these
malicious PDFs. Does the presence of a single space character with an
image sound like a unique pattern, and if so, how can I encapsulate
that into something I can trigger on? In other words, even something
like an exit code or other indication that it's the only real content
would be helpful.

Thanks,
Alex


More information about the poppler mailing list