[poppler] pdftotext and pdftohtml and extracting text
lrosenth at adobe.com
Sun Aug 27 15:38:36 UTC 2017
Why would an image only PDF (or an Image + a space) be a bad thing?
Checking the links in a PDF – regardless of the content – certainly seems like a reasonable thing to do, however.
On 8/25/17, 9:50 PM, "poppler on behalf of Alex" <poppler-bounces at lists.freedesktop.org on behalf of mysqlstudent at gmail.com> wrote:
On Fri, Aug 25, 2017 at 5:58 PM, Adrian Johnson <ajohnson at redneon.com> wrote:
> On 26/08/17 02:47, Alex wrote:
>> I'm attempting to use pdftohtml and pdftotext on fedora25
>> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
>> extract the text from a particular PDF I need.
>> I'm trying to use the poppler-utils to work with a spamassassin plugin
>> to extract text from PDFs that may be malicious. Here is one such
>> It appears to extract the header information (author, date, etc) but
>> no text from within the PDF.
> That's because there is no text in the PDF.
> Here's the content stream
> The only text is a single space character. The rest is an image. There
> is also a link annotation. Maybe we could add an option to pdfinfo to
> list the annotations in the file and for link annotations show the URL.
Yes, I thought that might have been the problem, but the URL is what I
was specifically talking about.
That signature alone might be very helpful for identifying these
malicious PDFs. Does the presence of a single space character with an
image sound like a unique pattern, and if so, how can I encapsulate
that into something I can trigger on? In other words, even something
like an exit code or other indication that it's the only real content
would be helpful.
poppler mailing list
poppler at lists.freedesktop.org
More information about the poppler