[poppler] pdftotext and pdftohtml and extracting text

Leonard Rosenthol lrosenth at adobe.com
Mon Aug 28 14:27:32 UTC 2017

I’ve seen a *lot* of malicious PDFs, and the one you posted is the first one that I have even seen use that image technique.   On the other hand, there are billions of image-only PDFs in existence today from all the paper->PDF scanning…

Same with counting number of URLs – how many thousands or millions of PDFs would you like to see from the public web that only have a single URL?  

It’s your software – design and implement as you see fit – but I hope that you would choose to use a more methodical and less “guesswork” technique…


On 8/27/17, 1:36 PM, "Alex" <mysqlstudent at gmail.com> wrote:

    Hi Leonard,
    On Sun, Aug 27, 2017 at 11:38 AM, Leonard Rosenthol <lrosenth at adobe.com> wrote:
    > Why would an image only PDF (or an Image + a space) be a bad thing?
    That's a good point. I guess it wouldn't in and of itself, but
    virtually every malicious PDF is created in this way.
    > Checking the links in a PDF – regardless of the content – certainly seems like a reasonable thing to do, however.
    Malicious PDFs also typically only have one URL.
    There's no reason not to check every URL, but I'd also like to find a
    unique pattern, if possible, to identify possible zero-day or unique
    URLs as part of a spear-phishing campaign and give us a little bit of
    an advantage.

More information about the poppler mailing list