[Poppler-bugs] [Bug 8143] New: ~4% speedup on loading PDF through Lexer/Stream improvements

Wed Sep 6 00:45:20 PDT 2006

Please do not reply to this email: if you want to comment on the bug, go to    

the URL shown below and enter yourcomments there.     

https://bugs.freedesktop.org/show_bug.cgi?id=8143          

           Summary: ~4% speedup on loading PDF through Lexer/Stream
                    improvements
           Product: poppler
           Version: unspecified
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: general
        AssignedTo: poppler-bugs at lists.freedesktop.org
        ReportedBy: kkowalczyk at gmail.com

Using virtual Stream::getChar() is very expensive and results in
Lexer::getChar() showing up in the profile as second most expensive function
during PDF loading (and that excludes the cost of its children).

The reason for that is: in order to get next character from the stream, poppler
needs at least one virtual function call + stream's eof logic, but in most cases
it's more than that (e.g. often a stream is an embedded stream on top of flate
stream on top of file stream which brings the cost to 3 virtual function calls +
eof logic for each stream). That's very expensive and since getChar() is one of
the most frequently called functions, the result is that an operation that
should be essentially free.

The attached patch fixes this by adding a way to get direct access to Stream's
underlying buffer. That way a client (e.g. a Lexer) can request a buffer and
getChar() logic becomes very fast "if buffer not empty, get char from buffer,
otherwise ask for another buffer".

Not every stream supports that so clients have to also have fallback logic that
uses the current slow way of using Stream::getChar().

The additional interface to Stream is as follows:
* GBool hasGetBuf() - returns true if a given stream supports
getBuf()/ungetBuf() calls
* GBool getBuf(char **bufOut, int *bufSizeOut, int maxSize) - stream returns a
direct pointer to its underlying buffer in <bufOut> and <bufSizeOut>. It returns
gFalse if reached EOF. If returns gTrue, *bufSizeOut must be greater than zero.
Note that this is not a copy of data. Stream, not the client, is in control over
size of the buffer. Client can only limit the size via <maxSize> (defaults to
Stream::NO_SIZE_LIMIT which means that client doesn't care). This is sequential
access i.e. subsequent calls to getBuf() return next portion of stream's data.
* ungetBuf(int sizeToGoBack) - a client might not need all the data it got from
the Stream, which might ruin the stream for other clients. That's why ungetBuf()
is needed. Client uses it to "give back" unprocessed data to Stream so that when
another client calls getBuf() or getChar(), the right data is returned. It's an
equivalent of seeking back in a file.

It might not be the prettiest interface, but it allows going fast.

I've also converted Lexer to use this interface if available. It gives me around
4% speedup on loading a PDF (really depends on the type of PDF and type of
streams inside that PDF).

Frankly, I was disappointed that it's only ~4%. I was expecting much more. It
turns out that the culprit is current implementation of flate stream, which is
frequently used to compress streams inside PDFs. It decompresses data in very
small chunks (e.g. 8 bytes on average per getBuf() call in my test) so we don't
save nearly as much as if we were getting, say, 256 bytes at a time. I'm working
on improving that as well, but this change lays the necessary foundation. 

Other filter streams (e.g. CCITT/Ascii/DCT) could also be improved by taking
advantage of getBuf from their underlying stream and providing getBuf()
interface to their clients. So far I've only implemented getBuf() for
FileStream, EmbedStream and (not so efficiently) FlateStream.

This turned out tricky to implement so how do I know it works? 

I ran my stress test which includes running a test program over a random
collection of 1500+ pdfs, renders every page, records timing informations and
all errors reported by poppler via error() function and also does a visual
preview that I can see that pages are rendered correctly. That way I can tell if
it crashed on any PDF and compare the results with previous runs. The app
survived the stress test.          

--           
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email         

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.