[systemd-devel] systemd debug out of memory

Fri Mar 10 10:40:37 UTC 2017

On Sun, Mar 5, 2017 at 3:59 PM, Pascal Kolijn <p.kolijn at vu.nl> wrote:
> Peace,
>
> On 28/02/2017 16:00, Lennart Poettering wrote:
>> On Tue, 28.02.17 13:26, Pascal kolijn (p.kolijn at vu.nl) wrote:
>>
>>> Hi List,
>>>
>>> I've subscribed to this list to ask for help in debugging a problem we
>>> seem to have with the socket activated telnetd on a rhel7 system.
>>>
>>> A default install of telnetd collects data from some small boxes
>>> deployed in the field. It works for a long time and then suddenly:
>>>
>>> Feb 26 17:46:53 bibr systemd: Created slice user-6006.slice.
>>> Feb 26 17:46:53 bibr systemd: Starting user-6006.slice.
>>> Feb 26 17:46:53 bibr systemd: Started Session 2223341 of user <USER>.
>>> Feb 26 17:46:53 bibr systemd-logind: New session 2223341 of user <USER>.
>>> Feb 26 17:46:53 bibr systemd: Starting Session 2223341 of user <USER>.
>>> Feb 26 17:46:53 bibr systemd: Started Telnet Server (<IP>:28830).
>>> Feb 26 17:46:53 pbibr001 systemd: Starting Telnet Server (<IP>:28830)...
>>> Feb 26 17:46:57 bibr systemd: Failed to fork: Cannot allocate memory
>>
>> Hmm, Linux fork() returns ENOMEM if the maximum number of tasks on the
>> system is hit (yes this is a bit misleading, but that's how it is).
>> That max number of tasks is limited for example by the max number of
>> assignable pids as configured in /proc/sys/kernel/pid_max? Maybe you
>> hit that limit? Maybe something is leaking pids on your system? not
>> reaping zombies properly?
>
> As far as I can determine running out of pids is not the issue, as I can
> see pids being reused in a day, which will not say that some may still
> go missing over time, but how do I determine if that is the case...?
>
> What I do see is that the rss of the systemd process is slowly growing
> over time in the production environment. I've not been able (yet) to
> reproduce the situation in a test environment, which is a pity. I think
> I can simulate the telnet connects more accurately after I speak with
> the developer of the said boxes, and see if I can create a reproducible
> situation.
>
>>> Feb 26 17:46:57 bibr systemd: Assertion 'pid >= 1' failed at
>>> src/core/unit.c:1996, function unit_watch_pid(). Aborting.
>>> Feb 26 17:46:57 bibr001 systemd: Caught <ABRT>, cannot fork for core
>>> dump: Cannot allocate memory
>>> Feb 26 17:46:57 bibr systemd: Freezing execution.
>>
>> So this is definitely a bug. If the limit is hit, we hould certainly
>> not hit an assert. I tried to figure out how this could ever happen,
>> but afaics this should not be possible on current git at least. Any
>> chance you can try to reproduce this isue with something more recent
>> than a rhel7 box?
>
> Hmmm, the version we currently use in production is:
>
> # rpm -qa | grep systemd
> systemd-libs-219-19.el7_2.13.x86_64
> systemd-219-19.el7_2.13.x86_64
> systemd-sysv-219-19.el7_2.13.x86_64

I've backported bunch of fixes for memory leaks to
systemd-219-19.el7_2.14. From changelog,

* Mon Aug 22 2016 Lukas Nykryn <lnykryn at redhat.com> - 219-19.14
- core: fix memory leak on set-default, enable, disable etc (#1331667)
- nspawn: fix minor memory leak (#1331667)
- basic: fix error/memleak in socket-util (#1331667)
- core: fix memory leak in manager_run_generators() (#1331667)
- modules-load: fix memory leak (#1331667)
- core: fix memory leak on failed preset-all (#1331667)
- sd-bus: fix memory leak in test-bus-chat (#1331667)
- core: fix memory leak in transient units (#1331667)

Fix is in the code path that is hit everytime you log onto the box,
because every session has its own scope unit.

- bus: fix leak in error path (#1331667)
- shared/logs-show: fix memleak in add_matches_for_unit (#1331667)

>
> I think I can update it to the current state in 7.3 for the production
> machine, but will be reluctant to go for a more recent version...

Those fixes are of course included in 7.3 as well.

Michal

>
> Maybe in the test env, if I can reproduce it there.
>
>> Either way it appears that there's both a bug on your setup and in
>> systemd: something leaks processes (which is bug #1, in your setup)
>> and then systemd doesn't deal properly with that (which is bug #2, in
>> systemd upstream)...
>>
>> Lennart
>>
>
> Pascal Kolijn
> Vrije Universiteit Amsterdam
> _______________________________________________
> systemd-devel mailing list
> systemd-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/systemd-devel