[packagekit] Caching for APT backend (or how to get queries down to 0.5s)
Tom Parker
palfrey at tevp.net
Wed Sep 26 11:27:07 PDT 2007
Haven't had much time to play with PackageKit recently (little things
like a PhD thesis to complete keep getting in my way), but one of the
things I've now finally managed to get a chance to look at is the issues
of slow queries in apt. I'd love to eventually build a package manager
that does things like "find as you type", and if queries are taking too
long then this starts to be less usable.
Specifically, this dates back to the threading changes, and my
discovering that I can't really keep around a pointer to the apt
in-memory stuff, because that will screw with the dynamic module loading
stuff. The problem is that the libapt stuff is pretty slow to do its
standard setup, mainly 'cause it wants to do things like work out chunks
of dependancy stuff and read in lots (10s to 100s of MB) of text files
and manually parse them. This would be up there under things that newer
package management systems do better.
I've been playing around with some stuff locally, and I've got 3 options:
1) Use libapt. ~3-5 second delay before we even get to really start
doing queries, and really ugly C++ interfaces. OTOH, may have to use
this for installation if only for being certain about sanity issues, but
for queries this is too slow. Also, it stays slow even with a warm
file-cache. It *always* takes 3-5 seconds even if you've just made
another query. Hence why I wanted to cache this before, but I can't
figure out an easy way to cache this without actually keeping the
pointers around in memory.
2) Do my own limited parsing of the files, and avoid full db reading for
just simple searches. Attempts at this hit about 7-8 seconds/query,
which can probably be optimised, but the sheer quantity of plaintext to
be parsed limits what can be done here. Gets worse with a cold
file-cache (I've seen 10-15s).
3) Do step 2 (with more complete parsing), but then dump the results
somewhere and do queries off of that. This is what I've got locally now.
The idea is that every time apt does a refresh-cache (or at least one
that loads new data) build an Sqlite db of the packages. Also do this if
the cache has been wiped or if someone else has updated the primary
system cache behind our back. We then do queries off of this.
Current numbers for queries with type 3 are down to about 0.5s (i.e.
fast enough for good "find as you type") with a warm file-cache, hitting
about 3-ish with a cold-cache. Rebuild times are in the 20-30s region
for my system (3Ghz P4, random single HDD, with *massive* collection of
sources). If this is folded into the refresh-cache task, and the user
doesn't refresh the cache outside of PackageKit too often, then the
amount of times this actually affects the user should be minimal. Db on
my system is 27mb currently, and I'm for the moment storing it in the
same directory as the transactions db but as "apt.db". I went with
Sqlite mainly because of the existing PackageKit dependancy on it
(probably would have used it anyways)
Haven't pushed this yet, but can post the patch if any direct comments
on the code as opposed to comments on the idea come up. Haven't
re-written all the apt methods for this yet (only SearchName). At the
moment this is all apt internal, but might be an idea to split some of
it out eventually in case any other backend writer wants to do similar
and doesn't want to have to duplicate the work too much.
So, thoughts? Cracktastic points out of 10? Votes for me getting flamed
like crap by suggesting this sort of idea to the apt developers?
Tom
More information about the PackageKit
mailing list