[poppler] Speed improvements - chapter eleven

Wed Sep 6 00:48:16 PDT 2006

Hello,

This is another post in the series describing my obsession with
improving poppler performance.

https://bugs.freedesktop.org/show_bug.cgi?id=8143

Using virtual Stream::getChar() is very expensive and results in
Lexer::getChar() showing up in the profile as second most expensive
function during PDF loading (and that excludes the cost of its
children).

The reason for that is: in order to get next character from the
stream, poppler needs at least one virtual function call + stream's
eof logic, but in most cases it's more than that (e.g. often a stream
is an embedded stream on top of flate stream on top of file stream
which brings the cost to 3 virtual function calls + eof logic for each
stream). That's very expensive and since getChar() is one of the most
frequently called functions, the result is that an operation that
should be essentially free.

I attempt to fix this by adding a way to get direct access to Stream's
underlying buffer. That way a client (e.g. a Lexer) can request a
buffer and getChar() logic becomes very fast "if buffer not empty, get
char from buffer, otherwise ask for another buffer".

Not every stream supports that so clients have to also have fallback
logic that uses the current slow way of using Stream::getChar().

The additional interface to Stream is as follows:
* GBool hasGetBuf() - returns true if a given stream supports
getBuf()/ungetBuf() calls
* GBool getBuf(char **bufOut, int *bufSizeOut, int maxSize) - stream
returns a direct pointer to its underlying buffer in <bufOut> and
<bufSizeOut>. It returns gFalse if reached EOF. If returns gTrue,
*bufSizeOut must be greater than zero. Note that this is not a copy of
data. Stream, not the client, is in control over size of the buffer.
Client can only limit the size via <maxSize> (defaults to
Stream::NO_SIZE_LIMIT which means that client doesn't care). This is
sequential access i.e. subsequent calls to getBuf() return next
portion of stream's data.
* ungetBuf(int sizeToGoBack) - a client might not need all the data it
got from the Stream, which might ruin the stream for other clients.
That's why ungetBuf() is needed. Client uses it to "give back"
unprocessed data to Stream so that when another client calls getBuf()
or getChar(), the right data is returned. It's an equivalent of
seeking back in a file.

It might not be the prettiest interface, but it allows going fast.

I've also converted Lexer to use this interface if available. It gives
me around 5% speedup on loading a PDF (really depends on the type of
PDF and type of streams inside that PDF).

Frankly, I was disappointed that it's only ~~5%. I was expecting much
more. It turns out that the culprit is current implementation of flate
stream, which is frequently used to compress streams inside PDFs. It
decompresses data in very small chunks (e.g. 8 bytes on average per
getBuf() call in my test) so we don't save nearly as much as if we
were getting, say, 256 bytes at a time. I'm working on improving that
as well, but this change lays the necessary foundation.

Other filter streams (e.g. CCITT/Ascii/DCT) could also be improved by
taking advantage of getBuf from their underlying stream and providing
getBuf() interface to their clients. So far I've only implemented
getBuf() for FileStream, EmbedStream and (not so efficiently)
FlateStream.

This turned out tricky to implement so how do I know it works? I ran
my stress test which includes running a test program over a random
collection of 1500+ pdfs, renders every page, records timing
informations and all errors reported by poppler via error() function
and also does a visual preview that I can see that pages are rendered
correctly. That way I can tell if it crashed on any PDF and compare
the results with previous runs. The app survived the stress test.

-kjk
Sumatra - PDF viewer for Windows
http://blog.kowalczyk.info/software/sumatrapdf/