[systemd-devel] Journald Scalability

Mon Sep 10 13:24:28 PDT 2012

On Mon, 10.09.12 15:56, Roland Schatz (roland.schatz at students.jku.at) wrote:

> On 10.09.2012 09:57, Lennart Poettering wrote:
> > Well, I am not aware of anybody having done measurements recently
> > about this. But I am not aware of anybody running into scalability
> > issues so far. 
> I'm able to share a data point here, see attachment.
> 
> TLDR: Outputting the last ten journal entries takes 30 seconds on my server.
> 
> I have not reported this so far, because I'm not really sure whom to
> blame. Current suspects include the journal, btrfs and hyper-v ;)
> 
> Some details about my setup: I'm running Arch Linux on a virtual server,
> running on hyper-v on some windows host (outside of my control). I'm
> currently using systemd 189, but the journal files are much older.
> journald.conf is empty (everything commented, i.e., the default). The
> journal logging was activated in February (note the strange first output
> line that says the journal ends on 28 Feb, while still containing
> entries up to right now). Since then, I have not removed/archived any
> journal files from the system.
> 
> After issuing journalctl a few times, time goes down significantly, ever
> for larger values of -n (e.g. first try -n10 takes 30 secs, second -n10
> takes 18 secs, third -n10 takes 0.2 secs, after that, even -n100 takes
> 0.2 secs, -n500 takes 0.8 secs and so on). Rebooting or simply waiting a
> day or so makes it slow again.
> 
> Btrfs and fragmentation may be an issue: Defragmenting the journal files
> seems to make things better. But its hard to be sure whether this is
> really the problem, because defrag could just be pulling the journal
> files into the fs cache, therefore having a similar effect than the
> repeated journalctl...
> 
> I'm not able to reproduce the problem by copying the whole journal to
> another system and running journalctl there. I'm also seeing the getting
> faster effect, but it starts at 2 secs and then goes down to 0.2 secs.
> Also, I'm not seeing any difference between btrfs and ext4, so maybe
> really fragmentation is the issue, although I don't really understand
> how a day of logging could fragment the log files that badly, even on a
> COW filesystem. Yes, there still is the indirection of the virtual
> drive, but I have a fixed-size disk file residing on an ntfs drive, so
> there shouldn't be any noticable additional fragmentation coming from
> that setup.
> 
> I'm not sure what I can do investigate this further. For me this is a
> low priority problem, since the system is running stable and the problem
> goes away after a few journalctl runs. But if you have anything you'd
> like me to try, I'd be happy to assist.

Hmm, these are definitely weird results. A few notes:

Appending things to the journal is done by adding a few data objects to
the end of the file and then updating a couple of pointers in the
front. This is not an ideal access pattern on rotating media but my
educated guess would be that this is not your problem (not your only one
at least...), as the respective tables are few and should be in memory
quickly (that said, I didn't do any precise IO pattern measurements for
this). If the access pattern turns out to be too bad we could certainly
improve it (for example by delay-writing the updated pointers). 

Now, I am not sure what such an access pattern means to COW file systems
such as btrfs. Might be worth playing around with COW for that, for
example by removing the COW flag from the generated files (via
FS_NOCOW_FL in FS_IOC_SETFLAGS which you need to start before the first
write to the file, i.e. you need to patch journald for that).

So much about the write access pattern. Most likely that's not impacting
you much anyway, but the read access pattern is. And that's much more
chaotic usually (and hence slow on rotating disks) than the write access
pattern, as we write journal fields only once and the reference them by
all entries using them. Which might mean that we end up jumping around
on disk for each entry we try to read as we iterate through its
fields. But this could be improved easily too, for example, as the order
of fields is undefined we could just order them by offset, so that we
read them by their order on disk.

It would be really good getting some hard data about access patterns
before we optimize things, though...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.