[systemd-devel] systemd debug out of memory

Sun Mar 5 14:59:29 UTC 2017

Peace,

On 28/02/2017 16:00, Lennart Poettering wrote:
> On Tue, 28.02.17 13:26, Pascal kolijn (p.kolijn at vu.nl) wrote:
> 
>> Hi List,
>>
>> I've subscribed to this list to ask for help in debugging a problem we
>> seem to have with the socket activated telnetd on a rhel7 system.
>>
>> A default install of telnetd collects data from some small boxes
>> deployed in the field. It works for a long time and then suddenly:
>>
>> Feb 26 17:46:53 bibr systemd: Created slice user-6006.slice.
>> Feb 26 17:46:53 bibr systemd: Starting user-6006.slice.
>> Feb 26 17:46:53 bibr systemd: Started Session 2223341 of user <USER>.
>> Feb 26 17:46:53 bibr systemd-logind: New session 2223341 of user <USER>.
>> Feb 26 17:46:53 bibr systemd: Starting Session 2223341 of user <USER>.
>> Feb 26 17:46:53 bibr systemd: Started Telnet Server (<IP>:28830).
>> Feb 26 17:46:53 pbibr001 systemd: Starting Telnet Server (<IP>:28830)...
>> Feb 26 17:46:57 bibr systemd: Failed to fork: Cannot allocate memory
> 
> Hmm, Linux fork() returns ENOMEM if the maximum number of tasks on the
> system is hit (yes this is a bit misleading, but that's how it is).
> That max number of tasks is limited for example by the max number of
> assignable pids as configured in /proc/sys/kernel/pid_max? Maybe you
> hit that limit? Maybe something is leaking pids on your system? not
> reaping zombies properly?

As far as I can determine running out of pids is not the issue, as I can
see pids being reused in a day, which will not say that some may still
go missing over time, but how do I determine if that is the case...?

What I do see is that the rss of the systemd process is slowly growing
over time in the production environment. I've not been able (yet) to
reproduce the situation in a test environment, which is a pity. I think
I can simulate the telnet connects more accurately after I speak with
the developer of the said boxes, and see if I can create a reproducible
situation.

>> Feb 26 17:46:57 bibr systemd: Assertion 'pid >= 1' failed at
>> src/core/unit.c:1996, function unit_watch_pid(). Aborting.
>> Feb 26 17:46:57 bibr001 systemd: Caught <ABRT>, cannot fork for core
>> dump: Cannot allocate memory
>> Feb 26 17:46:57 bibr systemd: Freezing execution.
> 
> So this is definitely a bug. If the limit is hit, we hould certainly
> not hit an assert. I tried to figure out how this could ever happen,
> but afaics this should not be possible on current git at least. Any
> chance you can try to reproduce this isue with something more recent
> than a rhel7 box?

Hmmm, the version we currently use in production is:

# rpm -qa | grep systemd
systemd-libs-219-19.el7_2.13.x86_64
systemd-219-19.el7_2.13.x86_64
systemd-sysv-219-19.el7_2.13.x86_64

I think I can update it to the current state in 7.3 for the production
machine, but will be reluctant to go for a more recent version...

Maybe in the test env, if I can reproduce it there.

> Either way it appears that there's both a bug on your setup and in
> systemd: something leaks processes (which is bug #1, in your setup)
> and then systemd doesn't deal properly with that (which is bug #2, in
> systemd upstream)...
> 
> Lennart
> 

Pascal Kolijn
Vrije Universiteit Amsterdam