[systemd-devel] High CPU usage of journald

Mon Feb 18 18:08:13 PST 2013

On Mon, 18.02.13 19:52, Holger Hans Peter Freyther (holger at freyther.de) wrote:

> 
> On Mon, Feb 18, 2013 at 08:47:44AM -0800, David Strauss wrote:
> 
> Dear David, all,
> 
> 
> > It's possible that a lot of it could also allocate on the stack or use
> > stack-style management of a pool in heap. This is pretty
> > performance-critical code, and we're seeing similar CPU overhead.
>
> I think upstream doesn't care about the CPU overhead of the
> journald.

I can assure you that that's definitely not true. We do care about
performance.

That said, the journal is relatively new and obviously not optimized in
all details for all machines, and we knowingly and frequently pick the
more readable codepath rather than the faster one because readable and
manageable code is important to us.

You need to understand that on x86 journald (and systemd) are
virtually "free" CPU-wise, these days. We develop on x86. And hence
optimizing things was never the most important issue for us, because the
codepaths never showed up badly anywhere anyway. That we develop on x86
doesn't mean we wouldn't care for embedded CPUs. We do, and we have
optimized some things in this area already, but nobody yet gave this
love to the journald component of systemd. I mean, udev's entire history
is more or less one of optimizations. And in systemd you find quite a
few changes that are about optimization and nothing else. 

So, if you found something that isn't as fast as it should be and shows
up in profiling, then we really should optimize that, and am happy to
look into this. However, as I don't even possess any machine with a CPU
where these issues show up I am not sure I will be able to do this all
by myself...

For many string operations that might end up in inner loop we try to use
alloca() these days, however for some that's really hard to get right,
for example for sprintf() related loops...

> I browsed through journald-server.c:dispatch_message_real and there are
> several things one could do... there are certainly a lot more items.

We should really base this all on profiling data. Without that premature
optimizations will just make the code unreadable.

> 1.) Several things are done over and over again... e.g. calls to the
> have_effective_cap.. it is unlikely that it will change after the journald
> has started.

Does this really show up negatively in profiling runs?

> 2.) Avoid allocation part one...
> 
>                 r = get_process_cmdline(ucred->pid, 0, false, &t);
>                 if (r >= 0) {
>                         cmdline = strappend("_CMDLINE=", t);
>                         free(t);
> 
>                         if (cmdline)
>                                 IOVEC_SET_STRING(iovec[n++], cmdline);
>                 }
> 
> So instead of the strappend (which will do strdup..) one could make sure
> that 't' has enough space to prepend the _CMDLINE= or the _EXE, or the
> other strings.

If this is indeed something that shows up negative in profiling we could
just extend get_process_cmdline() with a prefix char* arg, that if
non-NULL is prepended implicitly to the returned string.

> 3.) Looking at code like... get_process_cmdline
> 
>   asprintf(&p, "/proc/%lu/cmdline", (unsigned long) pid)
> 
> So first of all /proc/%lu/ is created over and over again for exe, comm,
> cmdline, sessionid, cgroup... With a different API one could easily avoid
> this.

It would probably be best to just use strings on the stack for this anyway...

> 4.) Still at the same method..
> 
>         if (max_length == 0) {
>                 size_t len = 1;
>                 while ((c = getc(f)) != EOF) {
>                         k = realloc(r, len+1);
>                         if (k == NULL) {
> 
> is most likely not the most effective way to handle the allocation. Or
> maybe call get_process_cmdline with a max length?

Actually, if you look in the glibc sources you will find that it does
some exponentional size increasing anyway internally of realloc(). After
realizing this we recently droppped a couple of our own exponential
realloc() loops in favour of the simpler to read linear realloc(foo,
len+1) loops. I'd be really surprised if this would show up in any
profiling. If it did glibc's own exponential increas code would not
work...

> 5.) I wonder if some of the information could be sent from the systemd
> to avoid the work in the journald...

So it's a bit of a security thing too: the journal distuingishes
"trusted" fields from "untrusted" ones. The former are collected by
journald and read directly from the kernel and can thus not be
faked. The latter are passed in from the clients, and can thus be faked
data. The "trusted" fields are prefixed with an "_", and clients can
never send fields prefixed with "_".

> I will end up commenting out most of dispatch_message_real and check if
> the CPU down drops to a low single digit. I would assume it will.

This would certainly be interesting.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.