[systemd-devel] [EXT] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
Krister Johansen
kjlx at templeofstupid.com
Tue Nov 19 23:59:53 UTC 2024
On Tue, Nov 19, 2024 at 09:49:57AM -0800, Theodore Ts'o wrote:
> Yes, this can happen if the file system is mounted. The reason for
> this is that the kernel updates metadata blocks via the block buffer
> cache, with the jbd2 (journaled block layer v2) subsystem managing the
> atomic updates. The jbd2 layer will block buffer cache writebacks
> until the changes are committed in a jbd2 transaction. So the version
> on disk is guaranteed to be consistent.
>
> However, a buffer cache read does not have any consistency guarantees,
> and if the file system is being actively modified, it is possible that
> you could a superblock where the checksum hasn't yet been updated.
>
> The O_DIRECT read isn't a magic bullet. For example, if you have a
> scratch file system which is guaranteed not to survive a Kubernetes or
> Borg container getting aborted, you might decide to format the file
> system without a jbd2 journal, since that would be more efficient, and
> by definition you don't care about the contents of the file system
> after a crash. So there are millions of ext4 file systems in
> hyperscale computing environments that are created without a journal;
> and in that case, O_DIRECT will not be sufficient for guaranteeing a
> consistent read of the superblock.
Thanks for the additional detail on jbd2's involvement. When I
originally encountered this, it was on a 5.15 kernel where
ext4_commit_super() was still using mark_buffer_dirty() prior to
submitting the IO for the superblock write. I had managed to convince
myself that ext4_commit_super() holding the BH_lock combined with
O_DIRECT waiting for the dirty buffers associated with the superblock to
get written was sufficient to get a consistent read of the superblock.
I missed that this was changed as part of another bugfix[1].
The version of this fix that you applied for resize2fs has resulted in
no re-occurence of the problem in the environments where we had been
previously encountering the problem.
With libblkid, it's resulted in systemd-udevd removing
/dev/disk/by-label and /dev/disk/by-uuid links for devices when the
superblock checksum can't be read. This in turn has resulted in /boot
failing to mount (when it's on a separate filesystem), update-grub calls
failing because /boot isn't mounted, and we recently had a mkinitramfs
fail because the /dev/disk/by-uuid links were missing for the root
device.
The patch I sent has resolved the problems in our production
environments, and was also run through a battery of synthetic boot
tests. We've seen no re-occurence with it applied. I've also run the
change against the util-linux unit tests and observed no regressions.
I included systemd-devel on this in case other users were observing
disappearing /dev/disk/ links. I hoped I might save somebody else from
having to debug this a second time.
-K
[1] https://lore.kernel.org/all/20220520023216.3065073-1-yi.zhang@huawei.com/
More information about the systemd-devel
mailing list