[poppler] pdftotext raw

Leonard Rosenthol lrosenth at adobe.com
Fri May 17 11:00:22 UTC 2019


> what are the downsides/upsides of following the content stream order?
>
Depends on whether you know something about the PDF producers that you are getting content from.

If all the PDFs that you are trying to process are coming from modern, well written, products then you are probably fine.  However, poorly made PDF creators will produce PDFs that will end up resulting in garbage from your extraction process.

Leonard


On 5/17/19, 3:14 AM, "poppler on behalf of Massimo Redaelli" <poppler-bounces at lists.freedesktop.org on behalf of mredaelli at lari.digital> wrote:

    On Thu, May 16, 2019, 8:08 PM Albert Astals Cid <aacid at kde.org> wrote:
    
    > > Are there reasons not to use it?
    >
    > The man page explains the reason not to use it.
    
    
    Yes, I should have asked: what are the downsides/upsides of following
    the content stream order?
    
    But i guess I'm mainly asking:
    
    > Is the option going to be deprecated, or can we count on it being
    > there for the foreseeable future?
    
    
    M.
    
    On Thu, May 16, 2019 at 6:08 PM Albert Astals Cid <aacid at kde.org> wrote:
    >
    > El dijous, 16 de maig de 2019, a les 17:00:27 CEST, Massimo Redaelli va escriure:
    > > Hey all.
    > >
    > > Question regarding pdftotext.
    > >
    > > The help says that `raw` is not recommended anymore, but for all PDFs
    > > I tried it actually gives better results than the default mode, by
    > > which I mean that paragraphs are not interrupted by extraneous text,
    > > like headers or boxes.
    > > (I do have to handle hyphenated words, but that looks easy.)
    > >
    > > Is the option going to be deprecated, or can we count on it being
    > > there for the foreseeable future?
    > > Are there reasons not to use it?
    >
    > The man page explains the reason not to use it.
    >
    > Cheers,
    >   Albert
    >
    > >
    > > Thanks!
    > >
    > >
    >
    >
    >
    >
    > _______________________________________________
    > poppler mailing list
    > poppler at lists.freedesktop.org
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Clrosenth%40adobe.com%7C272a49cb5c3d414e939508d6da973c9c%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C636936740422101367&sdata=fg4lR%2FIlWWvLrtUsUbTtAI6yLYBDgR8F16oScaibohM%3D&reserved=0
    
    
    
    -- 
    M.
    _______________________________________________
    poppler mailing list
    poppler at lists.freedesktop.org
    https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fpoppler&data=02%7C01%7Clrosenth%40adobe.com%7C272a49cb5c3d414e939508d6da973c9c%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C1%7C636936740422101367&sdata=fg4lR%2FIlWWvLrtUsUbTtAI6yLYBDgR8F16oScaibohM%3D&reserved=0



More information about the poppler mailing list