[systemd-devel] daemon-reload seems racy

Mon Jan 20 08:00:31 PST 2014

2014/1/20 Colin Guthrie <gmane at colin.guthr.ie>:
> CC'ing Zbigniew as he's working on the Fedora bug AFAIK.
>
> 'Twas brillig, and Lennart Poettering at 20/01/14 12:37 did gyre and gimble:
>> On Thu, 16.01.14 12:28, Colin Guthrie (gmane at colin.guthr.ie) wrote:
>>
>>>
>>> 'Twas brillig, and Colin Guthrie at 14/01/14 13:28 did gyre and gimble:
>>>> 3. Some sort of kernel trigger for me today led it to run two reexecs
>>>> quite quickly and triggered this problem randomly during runtime. This
>>>> *might* have come in via "telinit u" instead. It doesn't appear that the
>>>> kernel actually execs telinit directly but perhaps userspace can react
>>>> on it in some way?
>>>
>>> OK, this, it turns out is a result of running prelink via cron.
>>>
>>> The prelink package we (Mageia) have is basically the same as the Fedora
>>> one. It has a cronjob which calls "telinit u" but the prelink binary
>>> itself calls "/sbin/init U" which does the same thing, thus two
>>> daemon-reexecs in rapid succession which triggers this bug.
>>>
>>> For now I've disabled the "telinit u" call in prelink, but the real
>>> trick would be fixing the bug/race in serialisation :)
>>
>> Hmm, so, normally PID 1 should not accept new requests after the
>> deserialization of the first reexec is complete.
>>
>> Let me sumarize this a bit:
>>
>> Is this about reexec or reload? Or both?
>
> I was confused at first, but it seems "both" in the end. See here for a
> reproduction case involving either (tho' reload requires a --no-block
> param to trigger):
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1043212#c20
>
>> This is supposed to trigger the issue? "systemctl daemon-reexec ;
>> systemctl daemon-reexec"? What precisely goes bad afterwards? Does this
>> always trigger the issue or only sometimes?
>
> On my system it's pretty reliable and will trigger it every time. It
> might need a setup where loading the serialised state triggers a few
> jobs to make it take longer. e.g. on my setup the Type=oneshot units
> were all rerun when reloading the state (which actually seems wrong to
> me - e.g. my alsa-restore.service job kicked in again which made an
> in-progress VoIP call weird by suddenly changing my Headphones port back
> to Speakers!! - I've since started using the alsa-state daemon instead
> which mitigates things, but re-running oneshot's seems wrong no?)
>
>> What version are you using? Can you reproduce the issue on git?
>
> It's almost identical to the fedora 20 version - 208 + lots of patches.
>
> Not tried latest git yet I'm afraid, but as it also apparently affects
> fedora 20 (see above bug) I'm guessing you'll need something backported
> anyway and I'm not sure if there is any specific fix (although the sdbus
> port might have fixed it indirectly if it doesn't occur any more)

fwiw: I just tested with a (quiet recent) git version and I can't reproduce it.
note that this is on archlinux, without any sysv compat stuff.