[poppler] Page::display function performance

Reece Dunn msclrhd at googlemail.com
Mon Mar 9 06:38:40 PDT 2009


2009/3/9 Ilya Gorenbein <igorenbein at finjan.com>:
>
> I need to extract the text out of the document/page.
>
> I tried a void Page::display(OutputDev *out, double hDPI, double vDPI,
>                   int rotate, GBool useMediaBox, GBool crop,
>                    GBool printing, Catalog *catalog,
>                    GBool (*abortCheckCbk)(void *data),
>                    void *abortCheckCbkData,
>                    GBool (*annotDisplayDecideCbk)(Annot *annot, void
> *user_data),
>                    void *annotDisplayDecideCbkData) ;
>
> function (poppler version 0.10.4). When I measured performance of this
> function, I’ve got ~1.5 Mb/sec on dual core 2.33GHz CPU, 2 Gb of RAM, with
> kernel 2.6.24-17, Debian lenny distro.
>
> Please, advice me how the performance of this function could be improved. Is
> there another (cheaper) way to extract text out of the document/page.

Have you run the code through a profiler to see where the hotspots are
- which functions most of the time is being spent in?

Note that PDF text extraction is complex as the data is encoded in a
complex format that contains several streams that can be compressed in
different formats, and the text needs to be assembled and potentially
rearranged.

Also, what options are you using? Things like keeping the layout or
collapsing the text into blocks could also affect performance.

Which libraries are you using as external libraries and which are you
using the Poppler/xpdf versions of (zlib, jpeg, jbig2000, ...)?

I haven't done any performance testing of the Poppler/xpdf code, so
you'll need to do some experimentation. It will also vary with
different PDF documents and whether you are using native libraries or
Poppler variants. Compiler options and system library builds will also
affect performance (e.g. build options for glibc and the targetted
kernel version).

What kind of performance are you expecting? Anyway, start with the
profiler and work from there.

- Reece


More information about the poppler mailing list