[systemd-devel] Jobs dropped to readily (predm/start dropped as a dep while deleting plymouth-quit/stop)
Colin Guthrie
gmane at colin.guthr.ie
Tue Apr 10 01:47:06 PDT 2012
'Twas brillig, and Colin Guthrie at 09/04/12 01:56 did gyre and gimble:
> 'Twas brillig, and Colin Guthrie at 09/04/12 00:29 did gyre and gimble:
>> Here we can see why prefdm doesn't get started. It was dropped as a dep
>> to break an ordering cycle. However, it's actually part of the cycle
>> itself, and thus it likely should be excluded from the dependant jobs
>> when they are deleted.
>>
>> i.e. a job may be a dependency of the job being dropped, but it might
>> also exist in it's own right as a dep elsewhere. In such circumstances,
>> shouldn't it be allowed to continue?
>>
>> Or perhaps dependant jobs should not be cleared in the first loop. i.e.
>> try continuing without deleting dependant jobs, but keep a list of those
>> that would be deleted. If the first loop did not solve the problem, then
>> delete the deps.
>>
>> Or perhaps when deleting a stop job, we should not delete any dependant
>> start jobs? Or even somehow process conflicts first before verifying the
>> order? To explain some of the rules here are:
>
> Just as a random though, I tried simply not deleting any dependant jobs
> as per the attached patch.
>
> This resulted in the following results:
>
> Before:
> [ 4.165800] systemd[1]: Activating default unit: default.target
> [ 4.165825] systemd[1]: Trying to enqueue job
> graphical.target/start/replace
> [ 4.166048] systemd[1]: Found ordering cycle on basic.target/start
> [ 4.166048] systemd[1]: Walked on cycle path to sockets.target/start
> [ 4.166048] systemd[1]: Walked on cycle path to syslog.socket/start
> [ 4.166048] systemd[1]: Walked on cycle path to basic.target/start
> [ 4.166165] systemd[1]: Breaking ordering cycle by deleting job
> syslog.socket/start
> [ 4.166165] systemd[1]: Found ordering cycle on prefdm.service/start
> [ 4.166165] systemd[1]: Walked on cycle path to
> plymouth-quit.service/stop
> [ 4.166165] systemd[1]: Walked on cycle path to rc-local.service/start
> [ 4.166165] systemd[1]: Walked on cycle path to rinetd.service/start
> [ 4.166165] systemd[1]: Walked on cycle path to atieventsd.service/start
> [ 4.166165] systemd[1]: Walked on cycle path to prefdm.service/start
> [ 4.166165] systemd[1]: Breaking ordering cycle by deleting job
> plymouth-quit.service/stop
> [ 4.166165] systemd[1]: Deleting job prefdm.service/start as
> dependency of job plymouth-quit.service/stop
> [ 4.166165] systemd[1]: Found ordering cycle on prefdm.service/stop
> [ 4.166171] systemd[1]: Walked on cycle path to getty at tty1.service/start
> [ 4.166179] systemd[1]: Walked on cycle path to
> plymouth-quit-wait.service/start
> [ 4.166195] systemd[1]: Walked on cycle path to rc-local.service/start
> [ 4.166198] systemd[1]: Walked on cycle path to rinetd.service/start
> [ 4.166201] systemd[1]: Walked on cycle path to atieventsd.service/start
> [ 4.166204] systemd[1]: Walked on cycle path to prefdm.service/stop
> [ 4.166207] systemd[1]: Breaking ordering cycle by deleting job
> getty at tty1.service/start
> [ 4.166311] systemd[1]: Installed new job graphical.target/start as 1
>
>
> After:
> [ 4.396671] systemd[1]: Activating default unit: default.target
> [ 4.396697] systemd[1]: Trying to enqueue job
> graphical.target/start/replace
> [ 4.397007] systemd[1]: Found ordering cycle on basic.target/start
> [ 4.397011] systemd[1]: Walked on cycle path to sockets.target/start
> [ 4.397014] systemd[1]: Walked on cycle path to syslog.socket/start
> [ 4.397017] systemd[1]: Walked on cycle path to basic.target/start
> [ 4.397020] systemd[1]: Breaking ordering cycle by deleting job
> syslog.socket/start
> [ 4.397026] systemd[1]: Found ordering cycle on prefdm.service/start
> [ 4.397029] systemd[1]: Walked on cycle path to
> plymouth-quit.service/stop
> [ 4.397030] systemd[1]: Walked on cycle path to rc-local.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to rinetd.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to atieventsd.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to prefdm.service/start
> [ 4.397030] systemd[1]: Breaking ordering cycle by deleting job
> plymouth-quit.service/stop
> [ 4.397030] systemd[1]: Found ordering cycle on prefdm.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to getty at tty1.service/stop
> [ 4.397030] systemd[1]: Walked on cycle path to
> plymouth-quit-wait.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to rc-local.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to rinetd.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to atieventsd.service/start
> [ 4.397030] systemd[1]: Walked on cycle path to prefdm.service/start
> [ 4.397030] systemd[1]: Breaking ordering cycle by deleting job
> getty at tty1.service/stop
> [ 4.397030] systemd[1]: Looking at job prefdm.service/start
> conflicted_by=no
> [ 4.397030] systemd[1]: Looking at job prefdm.service/stop
> conflicted_by=no
> [ 4.397132] systemd[1]: Fixing conflicting jobs by deleting job
> prefdm.service/stop
> [ 4.397132] systemd[1]: Installed new job graphical.target/start as 1
>
>
> This is obviously good in this case, but I'm not sure what the knock on
> effects will be.
Just giving this a bit more thought over the weekend.
Some things strike me:
1. It would be very nice to be able to include a message somewhere that
a given unit's job was dropped such that sysadmins can see this easily
without looking specifically at systemd logs.
Keep in mind that they will be very unit-centric when this problem
occurs and as such it'll be the "systemctl status foo.unit" that happens
first. Can we put something into the status output that relates to the
job history? Perhaps just log it in the journal, but tag it somehow so
that it can be contextually extracted for a given unit? Maybe this would
work, but maybe a history of jobs should just be kept somehow in the
structure. That way we could show all the jobs run as a big list, but
also, for a given unit, we should show in the status output something like:
prefdm.service - Display Manager
Loaded: loaded (/lib/systemd/system/prefdm.service; static)
Active: active (running) since Tue, 10 Apr 2012 01:17:28 +0100; 8h ago
Main PID: 1878 (gdm-binary)
Jobs: start(pending); stop(succeeded); start(cancelled); 3 more
CGroup: name=systemd:/system/prefdm.service
├ 1878 /usr/sbin/gdm-binary -nodaemon
├ 1947 /usr/lib64/gdm-simple-slave --display-id /org/gnom...
└ 1950 /etc/X11/X :0 -br -verbose -auth /var/run/gdm/auth...
That way it would be much clearer as to what went wrong. If a job was
cancelled or failed that would be a good indicator as to why something
might not be working rather than just to have the job deleted and a
little bit of log go into systemd's log file. I think most sysadmins
would very much appreciate a unit-centric approach to jobs here.
A second command "list-unit-jobs" or "unit-list-jobs" would show all the
jobs for a given unit (in date order, recent first probably most useful).
"list-jobs" could also get a --historic or --past argument to show
previous jobs.
So that would be very nice.
2. When resolving ordering cycles, is it really right to delete the
whole job? That's quite drastic action! While I can see the logic in NOT
doing this, would it be better to simply drop a given ordering
dependency, not the job itself? What I mean is, still carry on with the
job requested but accept that we'll have done it at the wrong time.
I'm not really sure if it's better to do a job at the wrong time or to
simply not do it at all. I think the latter actually seems more correct
(i.e. no change). Just thought I'd mention it.
Cheers
Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/
Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
More information about the systemd-devel
mailing list