[systemd-devel] consider dropping defrag of journals on btrfs

Fri Feb 5 20:43:30 UTC 2021

On Fri, 5 Feb 2021 17:44:14 +0100
Lennart Poettering <lennart at poettering.net> wrote:
> On Fr, 05.02.21 16:06, Dave Howorth (systemd at howorth.org.uk) wrote:
> 
> > On Fri, 5 Feb 2021 16:23:02 +0100
> > Lennart Poettering <lennart at poettering.net> wrote:  
> > > I don't think that makes much sense: we rotate and start new
> > > files for a multitude of reasons, such as size overrun, time
> > > jumps, abnormal shutdown and so on. If we'd always leave a fully
> > > allocated file around people would hate us...  
> >
> > I'm not sure about that. The file is eventually going to grow to
> > 128 MB so if there isn't space for it, I might as well know right
> > now as later. And it's not like the space will be available for
> > anything else, it's left free for exactly this log file.  
> 
> let's say you assign 500M space to journald. If you allocate 128M at a
> time, this means the effective unused space is anything between 1M and
> 255M, leaving just 256M of logs around. it's probably surprising that
> you only end up with 255M of logs when you asked for 500M. I'd claim
> that's really shitty behaviour.

If you assign 500 MB for something that accommodates multiples of 128
MB then you're not very bright :) 512 MB by contrast can accommodate 4
128 MB files, and I might allocate an extra MB or two for overhead, I
don't know. So when it first starts there'll be 128 MB allocated and
384 MB free. In stable state there'll be 512 MB allocated and nothing
free. One 128 MB allocated and slowly being used. 384 MB full of
archive files. You always have between 384 MB and 512 MB of logs
stored. I don't understand where you're getting your numbers from.

BTW, I expect my linux systems to stay up from when they're booted
until I tell them to stop, and that's usually quite a while.

> > Or are you talking about left over files after some exceptional
> > event that are only part full? If so, then just deallocate the
> > unwanted empty space from them after you've recovered from the
> > exceptional event.  
> 
> Nah, it doesn't work like this: if a journal file isn't marked clean,
> i.e. was left in some half-written state we won't touch it, but just
> archive it and start a new one. We don't know how much was correctly
> written and how much was not, hence we can't sensibly truncate it. The
> kernel after all is entirely free to decide in which order it syncs
> writte blocks to disk, and hence it quite often happens that stuff at
> the end got synced while stuff in the middle didn't.

If you can't figure out which parts of an archived file are useful and
which aren't then why are you keeping them? Why not just delete them?
And if you can figure it out then why not do so and compact the useful
information into the minimum storage?

> > > Also, we vacuum old journals when allocating and the size
> > > constraints are hit. i.e. if we detect that adding 8M to journal
> > > file X would mean the space used by all journals together would
> > > be above the configure disk usage limits we'll delete the oldest
> > > journal files we can, until we can allocate 8M again. And we do
> > > this each time. If we'd allocate the full file all the time this
> > > means we'll likely remove ~256M of logs whenever we start a new
> > > file. And that's just shitty behaviour.  
> >
> > No it's not; it's exactly what happens most of the time, because all
> > the old log files are exactly the same size because that's why they
> > were rolled over. So freeing just one of those gives exactly the
> > right size space for the new log file. I don't understand why you
> > would want to free two?  
> 
> Because fs metadata, and because we don't always write files in
> full. I mean, we often do not, because we start a new file *before*
> the file would grow beyond the threshold. this typically means that
> it's typically not enough to delete a single file to get the space we
> need for a full new one, we usually need to delete two.

Why would you start a new file before the old one is full? Modulo truly
exceptional events. It's a genuine question - I don't think I've ever
seen it. And sure fs metadata - that just means allocate a bit extra
beyond the round number.

> actually it's even worse: btrfs lies in "df": it only updates counters
> with uncontrolled latency, hence we might actually delete more than
> necessary.

Sorry dunno much about btrfs. I'm planning to get rid of it here soon.

> Lennart

PS I'm subscribed to the list. I don't need a copy.