[systemd-devel] daemon-reload seems racy

Colin Guthrie gmane at colin.guthr.ie
Mon Jan 20 06:14:55 PST 2014


CC'ing Zbigniew as he's working on the Fedora bug AFAIK.

'Twas brillig, and Lennart Poettering at 20/01/14 12:37 did gyre and gimble:
> On Thu, 16.01.14 12:28, Colin Guthrie (gmane at colin.guthr.ie) wrote:
> 
>>
>> 'Twas brillig, and Colin Guthrie at 14/01/14 13:28 did gyre and gimble:
>>> 3. Some sort of kernel trigger for me today led it to run two reexecs
>>> quite quickly and triggered this problem randomly during runtime. This
>>> *might* have come in via "telinit u" instead. It doesn't appear that the
>>> kernel actually execs telinit directly but perhaps userspace can react
>>> on it in some way?
>>
>> OK, this, it turns out is a result of running prelink via cron.
>>
>> The prelink package we (Mageia) have is basically the same as the Fedora
>> one. It has a cronjob which calls "telinit u" but the prelink binary
>> itself calls "/sbin/init U" which does the same thing, thus two
>> daemon-reexecs in rapid succession which triggers this bug.
>>
>> For now I've disabled the "telinit u" call in prelink, but the real
>> trick would be fixing the bug/race in serialisation :)
> 
> Hmm, so, normally PID 1 should not accept new requests after the
> deserialization of the first reexec is complete.
> 
> Let me sumarize this a bit:
> 
> Is this about reexec or reload? Or both?

I was confused at first, but it seems "both" in the end. See here for a
reproduction case involving either (tho' reload requires a --no-block
param to trigger):

https://bugzilla.redhat.com/show_bug.cgi?id=1043212#c20

> This is supposed to trigger the issue? "systemctl daemon-reexec ;
> systemctl daemon-reexec"? What precisely goes bad afterwards? Does this
> always trigger the issue or only sometimes?

On my system it's pretty reliable and will trigger it every time. It
might need a setup where loading the serialised state triggers a few
jobs to make it take longer. e.g. on my setup the Type=oneshot units
were all rerun when reloading the state (which actually seems wrong to
me - e.g. my alsa-restore.service job kicked in again which made an
in-progress VoIP call weird by suddenly changing my Headphones port back
to Speakers!! - I've since started using the alsa-state daemon instead
which mitigates things, but re-running oneshot's seems wrong no?)

> What version are you using? Can you reproduce the issue on git?

It's almost identical to the fedora 20 version - 208 + lots of patches.

Not tried latest git yet I'm afraid, but as it also apparently affects
fedora 20 (see above bug) I'm guessing you'll need something backported
anyway and I'm not sure if there is any specific fix (although the sdbus
port might have fixed it indirectly if it doesn't occur any more)

Cheers

Col

-- 

Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
  Tribalogic Limited http://www.tribalogic.net/
Open Source:
  Mageia Contributor http://www.mageia.org/
  PulseAudio Hacker http://www.pulseaudio.org/
  Trac Hacker http://trac.edgewall.org/


More information about the systemd-devel mailing list