[systemd-devel] Journal file is synced each minute into disk

Tue May 10 14:38:53 UTC 2016

On Tue, 10.05.16 11:43, Ricardo Fraile (rfraile at idealista.com) wrote:

> No, there aren't any new messages in that time. This example was
> reproduced in a testing server without load, but in other server with
> nearly 10Mb of logs each minute the behaviour is the same.
> 
> Following the way to the root cause I fall into a high kworker I/O usage
> each minute, some related with ext4 journal said the following command:
> 
> while true ; do clear ; grep -n ^ /proc/$kworker_PID/stack | sort -rn |
> cut -d: -f 2- ; sleep 1 ; done
> 
> A search in google reported some similar issues with this filesystem,
> but a kernel update from 3.16.7 to 4.5.0 didn't fix anything.
> 
> Only removing the journal support in the ext4 log partition solved it,
> but I still don't understand why if SyncIntervalSec sync each 5 minutes,
> for example, the journal is so aggressive each minute in a server
> without disk writes, apart of logs.
> 
> Anyway, the problem is close to the filesystem side, I don't know if any
> other options defined in journald.conf can mitigate it.

syncing and the file change time have little to do with each
other.

The journal uses memory mapped file access to the journal
files. It will make normal memory writes to append new entries to the
journal files. On linux, writing to mmap()ed files will not result in
inotify/mtime updates. Since we use inotify for "journactl --follow"
and similar logic we actually manually schedule ftruncate() calls as a
way to trigger inotify/mtime updates. Now, as it turns out doing this
frequently actually is bad for the IO load, hence we'll actually don't
this all the time, but schedule it with a bit of lag, doing only few
delayed calls instead of instant, frequent calls after appending log
entries.

Now, that's one side of the story: the mtime updates. The other side
of the story is fsync()s: in order to keep loss of data minimal when
the system is abruptly shut down we'll actually sync the files fully
to disk in regular intervals. These intervals may be influenced via
SyncIntervalSec=, but note that this timer actually is defined as
"sync to disk at the latest after a log entry write". It hence gives
the guarantee that whenever something is logged it will not take any
longer than the specified time to actually safelly hit the disk. Syncs
are also done in other conditions, for example immediately after any
log message written with EMERGENCY/CRITIAL/ALERT log level.

Long story short: mtime has little to do with SyncIntervalSec=,
there's no direct connection and it#s really not the only reason why
mtimes are bumped, and mtime updates will happen much later than the
actual log event.

If you want to figure out what the journal is doing, my recommendation
would be to profile it with something like "prof", or alternatively
just "strace"ing it.

Lennart

-- 
Lennart Poettering, Red Hat