[systemd-devel] writing a systemd unit for nbd devices

Fri Mar 21 06:18:01 PDT 2014

On Fri, 21.03.14 13:05, Wouter Verhelst (wouter at debian.org) wrote:

Heya,

> The client side of the NBD protocol is implemented partially in
> user space, and partially in kernel space. The user space part handles
> connecting and the initial protocol negotiation; but once that has been
> done, nbd-client calls the NBD_DO_IT ioctl() on an open /dev/nbdX file,
> which hands the socket file descriptor to the kernel and which does not
> return until the device is disconnected (with "nbd-client -d", or
> because the link to the server died). As such, the nbd-client process
> needs to continue running while the device is connected.

The client should probably live in a service file nbd-client at .service,
that would be instantiated once for each configured nbd device.

> In addition, nbd-client needs to fork() and open() the /dev/nbdX device
> to support partitioned NBD devices (due to a deadlock issue, that can't
> be done from the initial NBD_DO_IT ioctl handling, so it is done in the
> first open() instead).

I don't follow here?

> For supporting root-on-NBD in conjunction with systemd, I've already
> added a -systemd-mark option to nbd-client so it will make argv[0][0]
> read as '@' (I think that method is slightly ugly, but that's a
> discussion for another time). In Debian, I've already supported
> root-on-NBD for quite a while with an initramfs script and some code in
> the init script of nbd-client which adds the PID for the root NBD device
> to the list of PIDs that shouldn't be killed; I understand that dracut
> (and hence Fedora as well) have similar support (though I'm not sure how
> well it all works).

Sounds good!

> Currently, in Debian, the situation is that there is a configuration
> file, /etc/nbd-client, which is sourced in the init script, and which
> contains bash arrays with configuration. The init script then loops over
> those bash arrays and runs the appropriate nbd-client command to connect
> the device. Any actual mounting (etc) of the device, then, is left to
> other init scripts. It expects that filesystems on NBD devices have the
> "_netdev" option in its fstab entry listed, so that it will be mounted
> by the "mountnfs" rather than "mountall" init script.

This sounds as if you want to convert this configuration file into
systemd units from a "generator" dynamically, so that it becomes nicely
integerated into the systemd dependency tree. That's how we handle
/etc/fstab and /etc/crypttab for example, where fstab-generator and
cryptsetup-generator create individual *.mount and cryptsetup@*.service
instances from the data in those files.

See the cryptsetup-generator and fstab-generator sources for inspiration. Also:

http://www.freedesktop.org/wiki/Software/systemd/Generators/

> - I will need to create dev-nbd at .device unit files. These unit files would
>   connect the device when needed.

.device units are mostly configured in udev rules. Also, there's are no
instantiated device units.

> - It may be a good idea to move the configuration from a sourced shell
>   script snipped to "something else". I do want to retain some backwards
>   compatibility, but it's okay if that's just a program interpreting the
>   shell script snippet and outputting something more modern.

Yeah, it sounds much better to maybe have /etc/nbdtab or so, which takes
inspiration fom /etc/fstab and /etc/crypttab, and then is converted
dynamically into systemd units with a generator (as suggested above)?

> NBD device nodes are a bit special in that due to the way NBD devices
> are connected, the device must exist at all times, even before it is
> connected; I suspect (though have not actually tried) that systemd will
> only try to "start" a .device unit file if the device node itself is not
> there yet. For NBD, the difference between a connected device and a
> not-connected one can be spotted in the apparent size of the block
> device (the BLKGETSIZE64 ioctl will return 0 for a not-connected device)
> and in the presence (or lack thereof) of a file /sys/block/nbdX/pid (if
> it exists, it contains the PID of the nbd-client process handling the
> connection; if it does not, the device is not connected), not by the
> presence (or lack thereof) of the device node itself.

OK, this looks like it is similar to the situation with loop devices,
which also exist unattached first, and only after some setup become
attached to an actual file. 

systemd only exposes udev devices as .device units if they have the
"systemd" tag in udev (all block devices actually get this tag set, so
there's nothing to do here), and if they have the SYSTEMD_READY=0
property not set (or in other words, have either SYSTEMD_READY unset, or
SYSTEMD_READY=1).

Now, currently 99-systemd.rules actually contains this rule:

    # Ignore nbd devices in the "add" event, with "change" the nbd is ready
    ACTION=="add", SUBSYSTEM=="block", KERNEL=="nbd*", ENV{SYSTEMD_READY}="0"

This sets SYSTEMD_READY=0 for the initial "add" event, and then expects
a second udev "change" event as soon as the device has been
configured. The loop device uses this rule instead:

    # Ignore loop devices that don't have any file attached
    SUBSYSTEM=="block", KERNEL=="loop[0-9]*", TEST!="loop/backing_file", ENV{SYSTEMD_READY}="0"

Given your hint with the "pid" sysfs attribute, I think we should change
the nbd rule to be more like the loop rule, and actually bind
SYSTEMD_READY to the state in sysfs, rather than the type events
last received.

Now, if SYSTEMD_READY= is done properly (which it should already be,
though this could be improved, as described above), then this should be
enough to make sure that the nbd devices are mounted/waited-for like any
other block device, at boot. fstab-generator already honours "_netdev",
and thus should order such mounts before remote-fs.target rather than
local-fs.target...

> This is not the case for partitions of NBD devices, however; these will
> only show up after the first open(), as explained above. As such, I
> might need two templates: one which connects the NBD device (for a
> /dev/nbdX device), and one for the partition (/dev/nbdXpY) which simply
> depends on the regular NBD device. However, if I understand correctly,
> it would not seem to be possible to create an nbdX template that does
> not also match nbdXpY.

Hmm, don't follow here... To make sure the partitions work correctly we
just need to make sure that SYSTEMD_READY=0 stays on the device nodes as
long as they don't work yet. And later, when they are set up properly,
an SYSTEMD_READY=0 is dropped systemd will pick them up just fine.

Regarding SYSTEMD_READY=0 also see systemd.device(5).

Lennart

-- 
Lennart Poettering, Red Hat