[systemd-devel] Looking for known memory leaks triggered by stress testing add/remove/up/down interfaces

Mon Feb 22 11:22:44 UTC 2021

On Thu, 18 Feb 2021, Lennart Poettering wrote:

> On Do, 18.02.21 11:48, Robert P. J. Day (rpjday at crashcourse.ca) wrote:
>
> >   A colleague has reported the following apparent issue in a fairly
> > old (v230) version of systemd -- this is in a Yocto Project Wind River
> > Linux 9 build, hence the age of the package.
> >
> >   As reported to me (and I'm gathering more info), the system was
> > being put through some "longevity testing" by repeatedly adding,
> > removing, activating and de-activating network interfaces. According
> > to the report, the result was heap space slowly but inexorably being
> > consumed.
> >
> >   While waiting for more info, I'm going to examine the commit log for
> > systemd from v230 moving forward to collect any commits that address
> > memory leaks, then peruse them more carefully to see if they might
> > resolve the problem.
> >
> >   I realize it's asking a bit for folks here to remember that far
> > back, but does this issue sound at all familiar? Any pointers that
> > might save me some time? Thanks.
>
> Note that our hash tables operate with an allocation cache: when
> adding entries to them and then removing them again the memory
> required for that is not returned to the OS but added to a local
> cache. When the next entry is then added again, we recycle the
> cached entry instead of asking for new memory again. This allocation
> cache is a bit quicker then going to malloc() all the time, but
> means if you just watch the heap you'll assume there's a leak even
> though there isn't really, the memory is not lost after all, and
> will be reused eventually if we need it.
>
> You may use the env var SYSTEMD_MEMPOOL=0 to turn this logic off,
> but not sure v230 already knew that env var.

  well, we seem to have isolated the issue, here it is in a nutshell
based on a condensed note i got from someone who tracked it down this
weekend. the memory leak is triggered by:

  $ ssh root@<target> -p 830 -s netconf   [830 = netconf over SSH]

long story short, according to jemalloc profiling, there is a massive
memory leak in DBUS code, to the tune of about 500M/day on a running
system. i'm perusing the profiling output now, but does any of this
sound even remotely familiar to anyone? i realize that's just a
summary, but does anyone remember seeing something related to this
once upon a time? [heavily-patched systemd_230 from wind river linux
9].

rday