[systemd-devel] [EXT] [PATCH] libblkid: fix spurious ext superblock checksum mismatches

Krister Johansen kjlx at templeofstupid.com
Tue Nov 19 23:59:53 UTC 2024


On Tue, Nov 19, 2024 at 09:49:57AM -0800, Theodore Ts'o wrote:
 
> Yes, this can happen if the file system is mounted.  The reason for
> this is that the kernel updates metadata blocks via the block buffer
> cache, with the jbd2 (journaled block layer v2) subsystem managing the
> atomic updates.  The jbd2 layer will block buffer cache writebacks
> until the changes are committed in a jbd2 transaction.  So the version
> on disk is guaranteed to be consistent.
> 
> However, a buffer cache read does not have any consistency guarantees,
> and if the file system is being actively modified, it is possible that
> you could a superblock where the checksum hasn't yet been updated.
> 
> The O_DIRECT read isn't a magic bullet.  For example, if you have a
> scratch file system which is guaranteed not to survive a Kubernetes or
> Borg container getting aborted, you might decide to format the file
> system without a jbd2 journal, since that would be more efficient, and
> by definition you don't care about the contents of the file system
> after a crash.  So there are millions of ext4 file systems in
> hyperscale computing environments that are created without a journal;
> and in that case, O_DIRECT will not be sufficient for guaranteeing a
> consistent read of the superblock.

Thanks for the additional detail on jbd2's involvement.  When I
originally encountered this, it was on a 5.15 kernel where
ext4_commit_super() was still using mark_buffer_dirty() prior to
submitting the IO for the superblock write. I had managed to convince
myself that ext4_commit_super() holding the BH_lock combined with
O_DIRECT waiting for the dirty buffers associated with the superblock to
get written was sufficient to get a consistent read of the superblock.
I missed that this was changed as part of another bugfix[1].

The version of this fix that you applied for resize2fs has resulted in
no re-occurence of the problem in the environments where we had been
previously encountering the problem.

With libblkid, it's resulted in systemd-udevd removing
/dev/disk/by-label and /dev/disk/by-uuid links for devices when the
superblock checksum can't be read.  This in turn has resulted in /boot
failing to mount (when it's on a separate filesystem), update-grub calls
failing because /boot isn't mounted, and we recently had a mkinitramfs
fail because the /dev/disk/by-uuid links were missing for the root
device.

The patch I sent has resolved the problems in our production
environments, and was also run through a battery of synthetic boot
tests.  We've seen no re-occurence with it applied.  I've also run the
change against the util-linux unit tests and observed no regressions.

I included systemd-devel on this in case other users were observing
disappearing /dev/disk/ links.  I hoped I might save somebody else from
having to debug this a second time.

-K


[1] https://lore.kernel.org/all/20220520023216.3065073-1-yi.zhang@huawei.com/



More information about the systemd-devel mailing list