[packagekit] Caching for APT backend (or how to get queries down to 0.5s)

Tom Parker palfrey at tevp.net
Wed Sep 26 11:27:07 PDT 2007


Haven't had much time to play with PackageKit recently (little things 
like a PhD thesis to complete keep getting in my way), but one of the 
things I've now finally managed to get a chance to look at is the issues 
of slow queries in apt. I'd love to eventually build a package manager 
that does things like "find as you type", and if queries are taking too 
long then this starts to be less usable.

Specifically, this dates back to the threading changes, and my 
discovering that I can't really keep around a pointer to the apt 
in-memory stuff, because that will screw with the dynamic module loading 
stuff. The problem is that the libapt stuff is pretty slow to do its 
standard setup, mainly 'cause it wants to do things like work out chunks 
of dependancy stuff and read in lots (10s to 100s of MB) of text files 
and manually parse them. This would be up there under things that newer 
package management systems do better.

I've been playing around with some stuff locally, and I've got 3 options:
1) Use libapt.  ~3-5 second delay before we even get to really start 
doing queries, and really ugly C++ interfaces. OTOH, may have to use 
this for installation if only for being certain about sanity issues, but 
for queries this is too slow. Also, it stays slow even with a warm 
file-cache. It *always* takes 3-5 seconds even if you've just made 
another query. Hence why I wanted to cache this before, but I can't 
figure out an easy way to cache this without actually keeping the 
pointers around in memory.

2) Do my own limited parsing of the files, and avoid full db reading for 
just simple searches. Attempts at this hit about 7-8 seconds/query, 
which can probably be optimised, but the sheer quantity of plaintext to 
be parsed limits what can be done here. Gets worse with a cold 
file-cache (I've seen 10-15s).

3) Do step 2 (with more complete parsing), but then dump the results 
somewhere and do queries off of that. This is what I've got locally now. 
The idea is that every time apt does a refresh-cache (or at least one 
that loads new data) build an Sqlite db of the packages. Also do this if 
the cache has been wiped or if someone else has updated the primary 
system cache behind our back. We then do queries off of this.

Current numbers for queries with type 3 are down to about 0.5s (i.e. 
fast enough for good "find as you type") with a warm file-cache, hitting 
about 3-ish with a cold-cache. Rebuild times are in the 20-30s region 
for my system (3Ghz P4, random single HDD, with *massive* collection of 
sources). If this is folded into the refresh-cache task, and the user 
doesn't refresh the cache outside of PackageKit too often, then the 
amount of times this actually affects the user should be minimal.  Db on 
my system is 27mb currently, and I'm for the moment storing it in the 
same directory as the transactions db but as "apt.db". I went with 
Sqlite mainly because of the existing PackageKit dependancy on it 
(probably would have used it anyways)

Haven't pushed this yet, but can post the patch if any direct comments 
on the code as opposed to comments on the idea come up. Haven't 
re-written all the apt methods for this yet (only SearchName). At the 
moment this is all apt internal, but might be an idea to split some of 
it out eventually in case any other backend writer wants to do similar 
and doesn't want to have to duplicate the work too much.

So, thoughts? Cracktastic points out of 10? Votes for me getting flamed 
like crap by suggesting this sort of idea to the apt developers?

Tom


More information about the PackageKit mailing list