[poppler] pdftotext and pdftohtml and extracting text
Leonard Rosenthol
lrosenth at adobe.com
Mon Aug 28 14:27:32 UTC 2017
I’ve seen a *lot* of malicious PDFs, and the one you posted is the first one that I have even seen use that image technique. On the other hand, there are billions of image-only PDFs in existence today from all the paper->PDF scanning…
Same with counting number of URLs – how many thousands or millions of PDFs would you like to see from the public web that only have a single URL?
It’s your software – design and implement as you see fit – but I hope that you would choose to use a more methodical and less “guesswork” technique…
Leonard
On 8/27/17, 1:36 PM, "Alex" <mysqlstudent at gmail.com> wrote:
Hi Leonard,
On Sun, Aug 27, 2017 at 11:38 AM, Leonard Rosenthol <lrosenth at adobe.com> wrote:
> Why would an image only PDF (or an Image + a space) be a bad thing?
That's a good point. I guess it wouldn't in and of itself, but
virtually every malicious PDF is created in this way.
> Checking the links in a PDF – regardless of the content – certainly seems like a reasonable thing to do, however.
Malicious PDFs also typically only have one URL.
There's no reason not to check every URL, but I'd also like to find a
unique pattern, if possible, to identify possible zero-day or unique
URLs as part of a spear-phishing campaign and give us a little bit of
an advantage.
Alex
More information about the poppler
mailing list