[systemd-devel] Antw: [EXT] Re: Odd status after core dump

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Aug 3 06:32:57 UTC 2020


>>> Lennart Poettering <lennart at poettering.net> schrieb am 31.07.2020 um 21:57
in
Nachricht <20200731195737.GA213997 at gardel-login>:
> On Fr, 31.07.20 11:52, Ulrich Windl (Ulrich.Windl at rz.uni-regensburg.de) 
> wrote:
> 
>> Hi!
>>
>> For systemd-228-157.12.5.x86_64 (SLES12 SP5) I had an odd status:
>> Three services were re-started in parallel, and one of those failed (stop
>> routine killed itself with signal 6 due to a timeout):
>>
>> Now the status systemctl displays looks like this (some details omitted):
>>iotwatch at NFS1.service - iotwatch I/O performance monitor instance "NFS1"
>>    Loaded: loaded (/etc/iotwatch.conf; enabled; vendor preset: disabled)
>>    Active: active (running) since Fri 2020-07-31 11:31:57 CEST; 6min ago
>> ...
>>  Main PID: 12246 (iotwatch-NFS1)
>>     Tasks: 4 (limit: 512)
>>    CGroup: /system.slice/system-iotwatch.slice/iotwatch at NFS1.service 
>>            └─12246 iotwatch-NFS1 -l
/var/log/iotwatch/NFS1/iotwatch-NFS1.log
>> ...
>>
>> Jul 31 11:31:57 v04 systemd[1]: Starting iotwatch I/O performance
moni.....
>> Jul 31 11:31:57 v04 systemd[1]: Started iotwatch I/O performance
monit...".
>> Jul 31 11:31:59 v04 systemd-coredump[12190]: Process 7123
(iotwatch-NFS1...
>>
>>                                                   Stack trace of thread 
> 7123:
>> Hint: Some lines were ellipsized, use -l to show in full.
>>
>> Now the odd thing is the sequence of journal entries:
>> The service started and then dumped core, but still is active?
>>
>> Actually what happened was: A client told the server process to stop, but 
> when
>> it didn't in time, it killed it with signal 6. So the old process stopped.
>> THEN the new process was started (PID 12246).
>>
>> Is the ordering due to the fact that coredump completed after the new 
> service
>> started? Or is it just systemd's archiving of the core dump file? I thought

> the
>> pid is visible until core-dump had completed....
> 
> When a coredump happens we'll collect it and then process it,
> i.e. compress it, generate stacktrace of it, and finally when we are
> done with all that report it to the journal. Hence it might be that
> the coredump processing finishes long after the service entered
> another cycle. In fact, we nice the coredump processing to nice level
> 9 since doing the stacktrace analysis, the compression and writing
> everyting to disk might be quite CPU and IO intensive, hence we
> priorize it down in order not to affect scheduling of everything else
> as negatively. This however means that if CPU/IO is scarce the
> coredump processing might be delayed quite a bit.

Hi!

OK, some additional notes (maybe these issues are fixed in a newer version
already):
Looking in my /var/log/messages, I see higher resolution of time stamps, and
specifically I see the original core-dump message:

2020-07-31T11:31:57.404198+02:00 v04 systemd[1]: iotwatch at NFS1.service: Main
process exited, code=dumped, status=6/ABRT
2020-07-31T11:31:57.421639+02:00 v04 systemd[1]: Stopped iotwatch I/O
performance monitor instance "NFS1".

(Using "journalctl I see lower-resolution timestamps as output)
Jul 31 11:31:57 v04 systemd[1]: iotwatch at NFS1.service: Main process exited,
code=dumped, status=6/ABRT
Jul 31 11:31:57 v04 systemd[1]: Stopped iotwatch I/O performance monitor
instance "NFS1".
Jul 31 11:31:57 v04 systemd[1]: iotwatch at NFS1.service: Unit entered failed
state.
Jul 31 11:31:57 v04 systemd[1]: iotwatch at NFS1.service: Failed with result
'core-dump'.

So why is the "core dump complete" included in the status of "starting" the
service when the core dump happened when stopping the service, and why isn't
the original core dump message included as well?

Regards,
Ulrich

> 
> Lennart
> 
> --
> Lennart Poettering, Berlin





More information about the systemd-devel mailing list