[systemd-devel] timed out waiting for device dev-disk-by\x2duuid

Thu May 15 14:20:40 PDT 2014

On Thu, May 15, 2014 at 10:57 PM, Chris Murphy <lists at colorremedies.com> wrote:
> On May 15, 2014, at 12:16 PM, Lennart Poettering <lennart at poettering.net> wrote:
>> On Thu, 15.05.14 19:29, Lennart Poettering (lennart at poettering.net) wrote:
>>> On Mon, 12.05.14 20:48, Chris Murphy (lists at colorremedies.com) wrote:
>>>
>>>> Two device Btrfs volume, with one device missing (simulated) will not
>>>> boot, even with rootflags=degraded set which is currently required to
>>>> enable Btrfs degraded mounts. Upon reaching a dracut shell after
>>>> basic.target fails with time out, I can mount -o subvol=root,degraded
>>>> and exit and continue boot normally with just the single device.
>>>>
>>>> The problem seems to be that systemd (udev?) is not finding the volume
>>>> by uuid for some reason, and therefore not attempting to mount it. But
>>>> I don't know why it can't find it, or even how the find by uuid
>>>> mechanism works this early in boot. So I'm not sure if this is a
>>>> systemd or udev bug, or a dracut, or kernel bug.
>>>>
>>>> The problem happens with systemd 208-9.fc20 with kernel
>>>> 3.11.10-301.fc20, and systemd 212-4.fc21 and kernel
>>>> 3.15.0-0.rc5.git0.1.fc21.
>>>
>>> As soon as btrfs reports that a file system is ready, systemd will pick
>>> it up. This is handled with the "btrfs" udev built-in, and invoked via
>>> /usr/lib/udev/rules.d/64-btrfs.rules. rootflags has no influence on
>>> that, as at that point it is not clear whether the block device will be
>>> the once that carries the root file system, or any other file system.
>>>
>>> Not sure what we should be doing about this. Maybe introduce a new
>>> btrfs=degraded switch that acts globally, and influences the udev built-in?
>>>
>>> Kay?
>>
>> So, as it turns out there's no kernel APi available to check whether a
>> btrfs raid array is now complete enough for degraded mode to
>> succeed. There's only a way to check whether it is fully complete.
>>
>> And even if we had an API for this, how would this even work at all? I
>> mean, just having a catchall switch to boot in degraded mode is really
>> dangerous if people have more than one array and we might end up
>> mounting an fs in degraded mode that actually is fully available if we
>> just waited 50ms longer...
>>
>> I mean this is even the problem with just one array: if you have
>> redundancy of 3 disks, when do you start mounting the thing when
>> degraded mode is requested on the kernel command line? as soon as
>> degrdaded mounting is possible (thus fucking up possible all 3 disks
>> that happened to show up last), or later?
>>
>> I have no idea how this all should work really, it's a giant mess. There
>> probably needs to be some btrfs userspace daemon thing that watches
>> btrfs arrays and does some magic if they timeout.
>>
>> But for now I am pretty sure we should just leave everything in fully
>> manual mode, that's the safest thing to do…
>
> Is it that the existing udev rule either doesn't know, or doesn't have a way of knowing, that rootflags=degraded should enable only the root=UUID device to bypass the "ready" rule?
>
> Does udev expect a different readiness state to attempt a mount, than a manual mount from dracut shell? I'm confused why the Btrfs volume is "not ready" for systemd which then doesn't even attempt to mount it; and yet at a dracut shell it is ready when I do the mount manually. That seems like two readiness states.
>

The btrfs kernel state has only one state, and that is what udev reacts to.

> I'd say it's not udev's responsibility, but rather Btrfs kernel code, to make sure things don't get worse with the file system, regardless of what devices it's presented with. At the time it tries to do the mount, it has its own logic for normal and degraded mounts whether the minimum number of devices are present or not and if not it fails. The degraded mount is also per volume, not global.
>
> For example if I remove a device, and boot degraded and work for a few hours making lots of changes (even doing a system update, which is probably insane to do), I can later reboot with the "stale" device attached and Btrfs figures it out, passively. That means it figures out if there's a newer copy when a file is read, and forwards the newest copy to user space, and "fixes" the stale copy on the previously missing device. A manual balance ensures all new files also have redundancy. I think it's intended eventually to have a smarter balance "catch up" filter that can also run automatically in such a case. In any case the file system isn't trashed.

The problem is when to actively force to degrade things when devices
do not show up in time. That is nothing the kernel can know, it would
need to be userspace making that decision. But udev does not really
have that information at that level, it would need to try until the
kernel is satisfied mounting a volume degraded.

This all is probably not a job for udev or systemd, but for a
specialized storage daemon which has explicit configuration/policy in
which way to mess around with the user's data.

This is not an area where we should try to be smart; falling back to
manual intervention from udev's side sounds like the right approach,
looking at the tools we (don't) have at hand at the moment.

Kay