<html> <head> <base href="https://bugs.freedesktop.org/" /> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW --- - udevd flock() failure on disk partitions due to fsck holding WRITE FLOCK" href="https://bugs.freedesktop.org/show_bug.cgi?id=79576#c2">Comment # 2</a> on <a class="bz_bug_link bz_status_NEW " title="NEW --- - udevd flock() failure on disk partitions due to fsck holding WRITE FLOCK" href="https://bugs.freedesktop.org/show_bug.cgi?id=79576">bug 79576</a> from <a class="email" href="mailto:jpsinthemix@verizon.net" title="jpsinthemix@verizon.net">jpsinthemix@verizon.net</a> <pre>(In reply to <a href="show_bug.cgi?id=79576#c1">comment #1</a>) > (In reply to <a href="show_bug.cgi?id=79576#c0">comment #0</a>) > > udevd skips all /dev/sda{.X} processing (and I speculate that fsck is then > > unable to complete its work as well). Looks like a file lock deadlock > > between fsck and udevd ? > > Udev's locking scheme is maybe incompatible with the fsck -l feature > that systemd-fsck requests. (systemd-)sck should stay away from > write-locking the entire disk device it has no business with; we need to > sort that out. > > > In passing also note that it appears that the file close of fd_lock is not > > done if the 'goto skip' path is taken in the udevd.c code above. > > Fixed. > > Thanks, > Kay Unfortunately, the patch >From e918a1b5a94f270186dca59156354acd2a596494 Mon Sep 17 00:00:00 2001 From: Kay Sievers <<a href="mailto:kay@vrfy.org">kay@vrfy.org</a>> Date: Tue, 03 Jun 2014 14:49:38 +0000 Subject: udev: exclude device-mapper from block device ownership event locking will have no effect for systems which do not use lvm.. which includes mine. Some thoughts/more info: I know that simple/default units YYY, with After=XXX, start after XXX has started. Does this hold if XXX is a oneshot as well? I thought not, ie, I thought that for XXX oneshots, YYY would start only after XXX completes/exits, but now I'm not sure. >From additional debugging, for boots where the fscks fail it looks like the systemd-fsck-root.service starts, invoking fsck -l, which locks the whole disk, and then systemd-fsck@.service is activated for the various non-root partitions before systemd-fsck-root.service has completed, whereas, for boots where all fscks succeed, systemd-fsck-root.service appears to have completed before the various systemd-fsck@.services are started. This is nasty, but to further test this idea, I inserted a crude wait loop in worker_run(), like so: int got_lock = 0; if (fd_lock >= 0) { int cc = 4; while (cc-- > 0 && (got_lock=flock(fd_lock, LOCK_SH|LOCK_NB)) < 0) { sleep(1); log_error("trying flock(%s; count: %d)", udev_device_get_devnode((struct udev_device *)dev), cc); } } if (fd_lock >= 0 && got_lock < 0) { log_debug("Unable to flock(%s), skipping event handling: %m", udev_device_get_devnode((struct udev_device *)dev)); err = -EWOULDBLOCK; fd_lock = safe_close(fd_lock); goto skip; } and sure enough boots are fine everytime, with every 1-2/3 entering the wait loop and counting down to 3 or 2 before the lock is gotten. Note that the wait loop is entered only for udev_device_get_devtype((struct udev_device *)dev)) ="sda". By the way, if in worker_run() you place the lock on the udev_device itself (not alyways on the parent if its a partition), then I have yet to see the collision, that is, fsck has a lock on /dev/sda only, not on any of the partitions, so the worker_run() flock always succeeds on partitions. Unfortunately, debugging like this is frustrating because this is one of those annoying Heisenbugs .. In udevd.c, when walking the event queue, are events ever re-tried, and if not does it make sense to do so, eg, for flock() -EWOULDBLOCK cases ?</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the QA Contact for the bug.</li> <li>You are the assignee for the bug.</li> </ul> </body> </html>