[poppler] pdftotext and pdftohtml and extracting text

Tue Aug 29 16:28:07 UTC 2017

Hi,

On Mon, Aug 28, 2017 at 10:27 AM, Leonard Rosenthol <lrosenth at adobe.com> wrote:
> I’ve seen a *lot* of malicious PDFs, and the one you posted is the first one that I have even seen use that image technique.   On the other hand, there are billions of image-only PDFs in existence today from all the paper->PDF scanning…
>
> Same with counting number of URLs – how many thousands or millions of PDFs would you like to see from the public web that only have a single URL?
>
> It’s your software – design and implement as you see fit – but I hope that you would choose to use a more methodical and less “guesswork” technique…

Thanks very much. I really don't know. Do you have any suggestions on
how to uniquely tag the malicious PDFs you've seen?

The pdftotext and similar utils do not output the URLs, making it more
difficult.

Thanks,
Alex

>
> Leonard
>
> On 8/27/17, 1:36 PM, "Alex" <mysqlstudent at gmail.com> wrote:
>
>     Hi Leonard,
>
>     On Sun, Aug 27, 2017 at 11:38 AM, Leonard Rosenthol <lrosenth at adobe.com> wrote:
>     > Why would an image only PDF (or an Image + a space) be a bad thing?
>
>     That's a good point. I guess it wouldn't in and of itself, but
>     virtually every malicious PDF is created in this way.
>
>     > Checking the links in a PDF – regardless of the content – certainly seems like a reasonable thing to do, however.
>
>     Malicious PDFs also typically only have one URL.
>
>     There's no reason not to check every URL, but I'd also like to find a
>     unique pattern, if possible, to identify possible zero-day or unique
>     URLs as part of a spear-phishing campaign and give us a little bit of
>     an advantage.
>
>     Alex
>
>