[poppler] pdftotext and pdftohtml and extracting text

Leonard Rosenthol lrosenth at adobe.com
Sun Aug 27 15:38:36 UTC 2017


Why would an image only PDF (or an Image + a space) be a bad thing?  

Checking the links in a PDF – regardless of the content – certainly seems like a reasonable thing to do, however.

Leonard

On 8/25/17, 9:50 PM, "poppler on behalf of Alex" <poppler-bounces at lists.freedesktop.org on behalf of mysqlstudent at gmail.com> wrote:

    Hi,
    
    On Fri, Aug 25, 2017 at 5:58 PM, Adrian Johnson <ajohnson at redneon.com> wrote:
    > On 26/08/17 02:47, Alex wrote:
    >> Hi,
    >> I'm attempting to use pdftohtml and pdftotext on fedora25
    >> (poppler-utils-0.45.0-5.fc25.x86_64) and I'm unable to get it to
    >> extract the text from a particular PDF I need.
    >>
    >> I'm trying to use the poppler-utils to work with a spamassassin plugin
    >> to extract text from PDFs that may be malicious. Here is one such
    >> example:
    >>
    >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fs%2Fb97pcvl1wm1oocq%2Fpdf-phish.pdf%3Fdl%3D0&data=02%7C01%7C%7Cec4d1f186a84418de9c208d4ec24d2b0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C636393090302511369&sdata=ZYGRUowz5XLtq4Tf5qcAmSgx6qsbdEeeNUvTA%2B5dVPY%3D&reserved=0
    >>
    >> It appears to extract the header information (author, date, etc) but
    >> no text from within the PDF.
    >
    > That's because there is no text in the PDF.
    >
    > Here's the content stream
    ...
    > The only text is a single space character. The rest is an image. There
    > is also a link annotation. Maybe we could add an option to pdfinfo to
    > list the annotations in the file and for link annotations show the URL.
    
    Yes, I thought that might have been the problem, but the URL is what I
    was specifically talking about.
    
    That signature alone might be very helpful for identifying these
    malicious PDFs. Does the presence of a single space character with an
    image sound like a unique pattern, and if so, how can I encapsulate
    that into something I can trigger on? In other words, even something
    like an exit code or other indication that it's the only real content
    would be helpful.
    
    Thanks,
    Alex
    _______________________________________________
    poppler mailing list
    poppler at lists.freedesktop.org
    https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7C%7Cec4d1f186a84418de9c208d4ec24d2b0%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636393090302511369&sdata=dm08Epwy3Ay6XiKfO0MZKnyZUBoqm38%2FNvJptdxFypA%3D&reserved=0
    



More information about the poppler mailing list