[systemd-devel] Journald Scalability

Mon Sep 10 14:30:34 PDT 2012

On 09/10/2012 02:24 PM, Lennart Poettering wrote:
> On Mon, 10.09.12 15:56, Roland Schatz (roland.schatz at students.jku.at) wrote:
> 
>> On 10.09.2012 09:57, Lennart Poettering wrote:
>>> Well, I am not aware of anybody having done measurements recently
>>> about this. But I am not aware of anybody running into scalability
>>> issues so far. 
>> I'm able to share a data point here, see attachment.
>>
>> TLDR: Outputting the last ten journal entries takes 30 seconds on my server.
>>
>> I have not reported this so far, because I'm not really sure whom to
>> blame. Current suspects include the journal, btrfs and hyper-v ;)
>>
>> Some details about my setup: I'm running Arch Linux on a virtual server,
>> running on hyper-v on some windows host (outside of my control). I'm
>> currently using systemd 189, but the journal files are much older.
>> journald.conf is empty (everything commented, i.e., the default). The
>> journal logging was activated in February (note the strange first output
>> line that says the journal ends on 28 Feb, while still containing
>> entries up to right now). Since then, I have not removed/archived any
>> journal files from the system.
>>
>> After issuing journalctl a few times, time goes down significantly, ever
>> for larger values of -n (e.g. first try -n10 takes 30 secs, second -n10
>> takes 18 secs, third -n10 takes 0.2 secs, after that, even -n100 takes
>> 0.2 secs, -n500 takes 0.8 secs and so on). Rebooting or simply waiting a
>> day or so makes it slow again.
>>
>> Btrfs and fragmentation may be an issue: Defragmenting the journal files
>> seems to make things better. But its hard to be sure whether this is
>> really the problem, because defrag could just be pulling the journal
>> files into the fs cache, therefore having a similar effect than the
>> repeated journalctl...
>>
>> I'm not able to reproduce the problem by copying the whole journal to
>> another system and running journalctl there. I'm also seeing the getting
>> faster effect, but it starts at 2 secs and then goes down to 0.2 secs.
>> Also, I'm not seeing any difference between btrfs and ext4, so maybe
>> really fragmentation is the issue, although I don't really understand
>> how a day of logging could fragment the log files that badly, even on a
>> COW filesystem. Yes, there still is the indirection of the virtual
>> drive, but I have a fixed-size disk file residing on an ntfs drive, so
>> there shouldn't be any noticable additional fragmentation coming from
>> that setup.
>>
>> I'm not sure what I can do investigate this further. For me this is a
>> low priority problem, since the system is running stable and the problem
>> goes away after a few journalctl runs. But if you have anything you'd
>> like me to try, I'd be happy to assist.
> 
> Hmm, these are definitely weird results. A few notes:
> 
> Appending things to the journal is done by adding a few data objects to
> the end of the file and then updating a couple of pointers in the
> front. This is not an ideal access pattern on rotating media but my
> educated guess would be that this is not your problem (not your only one
> at least...), as the respective tables are few and should be in memory
> quickly (that said, I didn't do any precise IO pattern measurements for
> this). If the access pattern turns out to be too bad we could certainly
> improve it (for example by delay-writing the updated pointers). 
> 
> Now, I am not sure what such an access pattern means to COW file systems
> such as btrfs. Might be worth playing around with COW for that, for
> example by removing the COW flag from the generated files (via
> FS_NOCOW_FL in FS_IOC_SETFLAGS which you need to start before the first
> write to the file, i.e. you need to patch journald for that).
> 
> So much about the write access pattern. Most likely that's not impacting
> you much anyway, but the read access pattern is. And that's much more
> chaotic usually (and hence slow on rotating disks) than the write access
> pattern, as we write journal fields only once and the reference them by
> all entries using them. Which might mean that we end up jumping around
> on disk for each entry we try to read as we iterate through its
> fields. But this could be improved easily too, for example, as the order
> of fields is undefined we could just order them by offset, so that we
> read them by their order on disk.
> 
> It would be really good getting some hard data about access patterns
> before we optimize things, though...
> 
> Lennart
> 

How can users provide meaningful data?

I've been experiencing the same problem. systemd 189, btrfs, Arch Linux, very
slow response when trying to view the journal (which includes systemctl status
<service>).  I wish I was positive about this, but the slowness might have
started when I enabled FSS.