[systemd-devel] systemd unexpectedly dropping into rescue mode - how do I best debug this?

Thu Oct 4 03:54:11 PDT 2012

* Kay Sievers <kay at vrfy.org> wrote:

> On Thu, Oct 4, 2012 at 12:05 PM, Ingo Molnar <mingo at kernel.org> wrote:
> >
> > * Tom Gundersen <teg at jklm.no> wrote:
> >
> >> Hi Ingo,
> >>
> >> On Thu, Oct 4, 2012 at 11:12 AM, Ingo Molnar <mingo at kernel.org> wrote:
> >> > I'm wondering how to debug the following systemd problem:
> >>
> >> [...]
> >>
> >> > Here are the units that are showing some sort of error:
> >> >
> >> > lyra:~> systemctl --all | grep -i err
> >> > exim.service              error  inactive dead          exim.service
> >> > iscsi.service             error  inactive dead          iscsi.service
> >> > iscsid.service            error  inactive dead          iscsid.service
> >> > livesys-late.service      error  inactive dead          livesys-late.service
> >> > livesys.service           error  inactive dead          livesys.service
> >> > named.service             error  inactive dead          named.service
> >> > postfix.service           error  inactive dead          postfix.service
> >> > remount-rootfs.service    error  inactive dead          remount-rootfs.service
> >> > ypserv.service            error  inactive dead          ypserv.service
> >>
> >> You might get useful information from:
> >>
> >> # systemctl status remount-rootfs.service
> >>
> >> (and similarly for the other ones).
> >
> > Querying those gives me the following uninformative output:
> 
> > Trying to start it again gives:
> >
> > [root at lyra ~]# systemctl start remount-rootfs.service
> > Failed to issue method call: Unit remount-rootfs.service failed
> > to load: No such file or directory. See system logs and
> > 'systemctl status remount-rootfs.service' for details.
> 
> These are just units where other units try to order against, but which
> are not available. It's nothing wrong here, besides the misguiding
> output, which we should think about what to tell instead.
> 
> 
> Could you add:
>   systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M
> to the kernel command line? It will print all userspace log to the
> kernel buffer, which is the most reliable way to store the logs when
> stuff goes wrong that early during bootup.
> 
> It should tell us more what's going on, and why you end up in the rescue shell.

Sure, will try this straight away.

> This is the wiki page about debugging:
>   http://www.freedesktop.org/wiki/Software/systemd/Debugging

Yeah, I've seen that - but I kind of expected that the bootup 
not going well is one of the most frequent failure cases for 
systemd - and I expected that once such a common failure mode 
happens all the info is there to recover and find the reason for 
the failure.

Having to reboot the system really destroys failure state, and 
it's a lucky circumstance that I can reproduce the failure and 
can give you debug output.

So I'm kind of surprised that I have to reboot the system to 
debug it further and that all the available error output is so 
unhelpful. With such a scheme wow do you debug spurious 
dependency/bootup bugs, or non-deterministic parallel bootup 
bugs/races, if it's not possible to get in situ data from 
production systems?

So for example, do we know from the info I've provided why 
systemd went into rescue mode? I suspect some target failed - 
can I list what targets filed for the default.target 
(multi-user.target) and re-try to load them?

I'm just trying to figure out how to analyze this best without 
having to reboot the system.

My best guess is that something is missing from the kernel I've 
booted - but it's strange that I'm unable to analyze/debug it 
from the failed, rescue state itself.

If I Ctrl-D the rescue shell it just reaches the multi-user 
target just fine. Is that expected?

Thanks,

	Ingo