[poppler] Some ideas for even more performance improvements

Sun Sep 17 18:13:33 PDT 2006

Hello.

This time I don't have a patch, just want to brain dump ideas on
possible performance improvements in poppler since the more I work
with poppler the more I understand how it works and how it can be
improved.

The PDF parsing stage is nothing more than parsing PDF format and
turning it into a bunch of instances of Object class, which closely
map to PDF spec objects.

Currently a big performance issue in poppler is a high rate of
creation and destructions of Object instances, which causes the
mechanics of memory allocation/deallocation take significant amounts
of time. Some of my existing fixes try to improve that either by
reducing number of allocation or making the cost of allocation smaller
via custom allocators.

Memory allocation per se isn't that expensive but if it's done
frequently, the cost is gonna accumulate.

Those objects originate in the Lexer and get processed by the Parser.
Parser discards some of them, creates more complex objects from basic
objects (e.g. a dictionary or array of objects) and passes them to the
caller. Caller does something with those objects (e.g. page
descriptions are kept in memory during lifetime of PDFDoc but page
drawing data is discarded after page is rendered).

So the general idea is to do less.

One concrete idea is based on the observation that creating objects in
Lexer isn't necessary, given that some of them are immediately
discarded by the Parser (Lexer's only client). Lexer could be changed
into a tokenizer, that only returns a token type and optional token
primitive value (boolean, integer, double or string). That way Lexer
wouldn't have to create the whole Object with a string to represent
command '[' (which is a PDF syntax for starting an array) that is
immediately discarded by the Parser. Instead, it could just return
TOK_ARRAY_START.

Another, more radical idea, is to redo Object a little bit. Currently
poppler mixes up deep copying with shallow copying when passing Object
instances around. Most functions get a pointer to Object and fill it
out. That creates a rather tight coupling between layers that makes
changes difficult (I tried).

My radical idea starts with converting the code so that functions
consistently return a newly allocated copies of objects and callers
are required to free them. Doing just that would probably be a perf
hit but it would allow a range of optimizations that are not currently
possible. Those optimization, I believe, would result in much faster
code and less memory used.

First optimization would be a custom allocator for objects, like the
one I did for strings or like generic fixed allocator (but specialized
for Object hence even faster).

Another would be refcounting, so that e.g. two independent string
Objects "foo" would share only one copy of actual data. There would
have to be only one instance of object null, error, eof, boolean true
and false. Many integers show up frequently and they would also all
share just one instance of Object representing that integer.

When you look at PDF stream, there are a lot of objects with values
that show up over and over again.

With refcounting a value of a unmutable Object (bool, integer, double,
string) would map 1:1 to it's address so comparison for equality could
be done via simple compare of addresses e.g. instead of doing
obj.isCmd("[") we could write if (gPreallocatedArrayStart == obj)
where gPreallocatedArrayStart is an object istance created at startup
that is of type command and with value being atomic string "[".

Additional tweaks could include pre-allocating common objects (like
null, error, eof, true/false, common integers) at startup, from their
own chunk of memory.

It would be possible to do an analysis of most common strings in PDF
files and pre-allocate string Objects for them.

That should dramatically reduce the amount of objects that have to be
created at runtime.

I don't think those would be difficult changes to make. Making the
foundation change of always passing object as copies is rather
straightforward conceptually although requires a lot of perseverance
to iron out all bugs, since it would touch a lot of code. I know - I
tried 3 times and every time I gave up after a couple of hours before
fixing all the resulting bugs. Unfortunately it's a change that can't
be done incrementally.

-- kjk