[systemd-devel] [PATCH 1/4] Adding halt binary to shutdown the system

Fri Oct 1 07:37:25 PDT 2010

I'm replying to this as couple of the bad ideas were my fault :-)

On Fri, Oct 1, 2010 at 10:52 AM, Lennart Poettering
<lennart at poettering.net> wrote:
> On Fri, 01.10.10 02:28, Gustavo Sverzut Barbieri (barbieri at profusion.mobi) wrote:
>
>> From: Fabiano Fidencio <fidencio at profusion.mobi>
>>
>> This functions are working as follows:
>>     - Send a SIGTERM to all process
>>     - Send a SIGKILL to all process
>>     - Try to umount all mount points
>>     - Try to remount read-only all mount points that can't
>>     be umounted
>>     - Call shutdown
>>
>>     If one step fail, shutdown will be aborted
>
> This shouldn't fail, ever. We need to make sure that we don't ever hang in
> the middle of nowhere.

yeah, actually I've noticed that getppid() was failing (of course) and
I was getting a kernel oops "trying to kill init" that I've spent
hours to debug as I thought the behavior of kill(-1, X) was not like
the documented. at the end it was just returning and thus it was not
defined. But we need to figure out what to do in case of errors.

>> +     systemd-halt \
>
> Could we name this "systemd-shutdown" please? I kinda see "shutdown" as
> a higher level term for all of "halt", "reboot", "poweroff" and "kexec".

this was not my idea, but I guess Fabiano did it due sysvinit main
tool and others being links.

>>       systemd-modules-load \
>>       systemd-remount-api-vfs \
>>       systemd-kmsg-syslogd \
>> @@ -349,7 +350,7 @@ libsystemd_core_la_SOURCES = \
>>       src/service.c \
>>       src/automount.c \
>>       src/mount.c \
>> -     src/umount.c \
>> +     src/halt.c \
>
> Hmm, this doesn't belong in libsystemd_core, and neither does umount.c.
>
>> +static void sig_alarm(int sig) {
>> +        run = 0;
>> +}
>
> Grmbl. Using signals like this is racy, since they might get delivered
> between the check for 'run' in the loop below, and the next call to
> wait(). The effect of that is the timout might get eaten up and the
> system freezes.
>
> Or in other words: never ever use SIGLARM or alarm().

well, my idea as well, but I like your race-free more ;-)

>> +int main(int argc, char *argv[]) {
>> +        int c, cmd;
>> +
>> +        if (getpid() != 1)
>> +                return -1;
>
> Returning -1 in main() is not really defined.
>
>> +
>> +        if (argc != 2)
>> +                return -1;
>> +
>> +        switch ((c = getopt(argc, argv, "hrp"))) {
>> +                case 'h':
>> +                        cmd = LINUX_REBOOT_CMD_HALT;
>
> There's a break missing here.

my stupid fault, i added but not tested it.

>> +                case 'p':
>> +                        cmd = RB_POWER_OFF;
>> +                        break;
>> +                case 'r':
>> +                        cmd = RB_AUTOBOOT;
>> +                        break;
>> +                default:
>> +                        return -1;
>> +        }
>
> I'd kinda prefer a proper verb on the command line here. I.e. invoking
> "systemd-shutdown halt" should halt the system, "systemd-shutdown
> reboot" should reboot it, and so on.

makes sense, and as we're just taking few pre-determined options we
can easily strcmp instead of going getopt().

>> +
>> +        /* lock us into memory */
>> +        mlockall(MCL_CURRENT|MCL_FUTURE);
>> +
>> +        signal(SIGALRM, sig_alarm);
>> +        kill(-1, SIGTERM);
>
> Unfortunately this will probably not be sufficient. Some PIDs probably
> must be spared (for example: mdmon and stuff). While I am not entirely
> sure how we best should allow apps to mark themselves to be spared here
> (my ideal solution would be to patch the kernel to allow user xattrs on
> /proc/$PID...), I think we must iterate through the processes and kill
> things individually, potentially omitting a few while doing that.
>
> If you do this then it is probably wise to make sure that the SIGTERM is
> delivered for all processes at the same time and not one after the
> other, hence it is recommended to use kill(-1, SIGSTOP) before and
> kill(-1, SIGCONT) afterwards. This is btw what the sysv implementation
> of killall5 does.

we saw the STOP->sig->CONT, but I asked him to remove as we could do
it better being pid1. His first implementation was very much like
sysv.

can you confirm we need this for mdmon and all? If they just ignore
SIGTERM, all we have to do is try to umount devicemapper before
SIGKILL.

>> +        alarm(10);
>
> Do not use alram.
>
>> +        while (run) {
>> +                pid_t r = wait(NULL);
>> +                if (r == (pid_t)-1 && errno == ECHILD)
>> +                        run = 0;
>> +        }
>
> While I like the idea of using wait() like this it won't work anymore if
> we need to spare some processes.
>
> You can use sigtimedwait() + waitpid(WNOHANG) to wait for SIGCHLD and
> apply a timeout.
>
> Make sure that if you iterate through the processes you count the
> processes you killed so that we know how many processes to wait for.
>
>> +        alarm(0);
>> +        kill(-1, SIGKILL);
>
> A similar loop is needed here.

why? things will just die... but yeah, the loop should exit quickly.

>> +
>> +        if (umount_init() < 0)
>> +                return -1;
>> +
>> +        sync();
>
> This is redundant, since the kernel does that anyway when you call
> reboot(). We probably should remove all invocations of sync()
> everywhere.

I've added it during debug, I read the man page that reboot() did not
ensure sync, and in the process I've lost some of my data... but at
the end it seems that the reason was something else and this ended
dangling... but it does not hurt either ;-)

>> +        return reboot(cmd);
>> +}
>
> Here are a couple of things I'd still like to see added, but which are
> not necessary to get the basic version merged:
>
> - use the devicemapper libraries to remove all dm disks after each
> unmounting run that can be removed (in particular useful for crypto
> disks). Also remove all loopback devices (i.e. /dev/loop).

shouldn't this be done BEFORE going to this step? Like in the regular
systemd units? When we run this binary we should pretty much done,
we're just giving things a last chance, but in a fully systemd
compliant system we should have no more left processes when their
units are stopped.

if you worry about an unity not being unmounted due processes keeping
it busy, maybe we can also introduce a non-pid1 killall that SIGTERM
all processes and that should help dm.

and do we really need to remove the devices themselves? isn't
unmounting their filesystems enough? In my not-so-experienced point of
view the data is already on media and removing the loop/crypto/dm
devices will just free system memory, but we're exiting linux anyway
(it's pretty much like calling free() before ending a process).

> - we need code that disables all swaps, similar to the code that umounts
> everything. This should probably be integrated into the umount loop, to
> properly deal with swap files.
>
> - As a minor optimization we might want an umount run for all
> non-API tmp disks before disabling all swaps. Why? Because when removing
> the swaps everything that is swapped out on it will be swapped back into
> memory. If this is tmpfs data that is not necessary, hence let's unmount
> the tmpfs stuff first and then remove the swaps. (the fedora reboot
> script does this). For the API tmpfs file systems (i.e. /sys/fs/cgroup
> and /dev this should not be necessary since it does not store any big
> files).

so we just do a pre-loop to umount tmpfs, then swap, then the regular one?

> - having support for kexec would be cool. As last step simply invoke
> /sbin/kexec -e -x -f, and if that fails fall back to a normal reboot.

do we really have to exec /sbin/kexec or is it possible to do some syscall?

> The umount/unswap loop should probably look like this:
>
> do {
>        p = umount_all();
>        q = unswap_all();
>        r = undm_all();
>        s = unloop_all();
> } while (p > 0 || q > 0 || r > 0 || s > 0);
>
> where the various xxx_all() functions return the number of
> unmounts/unspaws/undms/unloops they executed.

seems simple.

-- 
Gustavo Sverzut Barbieri
http://profusion.mobi embedded systems
--------------------------------------
MSN: barbieri at gmail.com
Skype: gsbarbieri
Mobile: +55 (19) 9225-2202