[systemd-devel] Slow startup of systemd-journal on BTRFS

Mon Jun 16 12:52:07 PDT 2014

On 16/06/14 17:05, Josef Bacik wrote:
> 
> On 06/16/2014 03:14 AM, Lennart Poettering wrote:
>> On Mon, 16.06.14 10:17, Russell Coker (russell at coker.com.au) wrote:
>>
>>>> I am not really following though why this trips up btrfs though. I am
>>>> not sure I understand why this breaks btrfs COW behaviour. I mean,

>>> I don't believe that fallocate() makes any difference to
>>> fragmentation on
>>> BTRFS.  Blocks will be allocated when writes occur so regardless of an
>>> fallocate() call the usage pattern in systemd-journald will cause
>>> fragmentation.
>>
>> journald's write pattern looks something like this: append something to
>> the end, make sure it is written, then update a few offsets stored at
>> the beginning of the file to point to the newly appended data. This is
>> of course not easy to handle for COW file systems. But then again, it's
>> probably not too different from access patterns of other database or
>> database-like engines...

Even though this appears to be a problem case for btrfs/COW, is there a
more favourable write/access sequence possible that is easily
implemented that is favourable for both ext4-like fs /and/ COW fs?

Database-like writing is known 'difficult' for filesystems: Can a data
log can be a simpler case?

> Was waiting for you to show up before I said anything since most systemd
> related emails always devolve into how evil you are rather than what is
> actually happening.

Ouch! Hope you two know each other!! :-P :-)

[...]
> since we shouldn't be fragmenting this badly.
> 
> Like I said what you guys are doing is fine, if btrfs falls on it's face
> then its not your fault.  I'd just like an exact idea of when you guys
> are fsync'ing so I can replicate in a smaller way.  Thanks,

Good if COW can be so resilient. I have about 2GBytes of data logging
files and I must defrag those as part of my backups to stop the system
fragmenting to a stop (I use "cp -a" to defrag the files to a new area
and restart the data software logger on that).

Random thoughts:

Would using a second small file just for the mmap-ed pointers help avoid
repeated rewriting of random offsets in the log file causing excessive
fragmentation?

Align the data writes to 16kByte or 64kByte boundaries/chunks?

Are mmap-ed files a similar problem to using a swap file and so should
the same "btrfs file swap" code be used for both?

Not looked over the code so all random guesses...

Regards,
Martin