[systemd-devel] Errorneous detection of degraded array

Luke Pyzowski Luke at sunrisefutures.com
Fri Jan 27 19:44:22 UTC 2017


I've modified a number of settings to try to resolve this, so far no success.
I've created an explicit mount file for the RAID array: /etc/systemd/system/share.mount
Inside there I've experimented with TimeoutSec=

In /etc/systemd/system/mdadm-last-resort at .timer I've worked with
OnActiveSec=

I've also tried (without an explicit mount file) to add x-systemd.device-timeout to /etc/fstab for the mount. 

Here's a few more system logs showing perhaps more detail. I've edited them to show only relevant details, full pastebin of logs: http://pastebin.com/sL8nKt7j
These logs were generated with TimeoutSec=120 in /etc/systemd/system/share.mount the description of the mount in the logs is: "Mount /share RAID partition explicitly"
And OnActiveSec=30 in /etc/systemd/system/mdadm-last-resort at .timer
From blkid:
/dev/md0: UUID="2b9114be-3d5a-41d7-8d4b-e5047d223129" TYPE="ext4"
/dev/md0 is the /share partition.

From /etc/mdadm.conf:
ARRAY /dev/md/0  metadata=1.2 UUID=97566d2f:ae7a169b:966f5840:3e8267f9 name=lnxnfs01:0

Boot begins at Jan 27 11:33:10
+4 seconds from boot:
Jan 27 11:33:14 lnxnfs01 systemd[1]: Found device /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959.
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdc operational as raid disk 2
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdx operational as raid disk 23
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdu operational as raid disk 20
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdt operational as raid disk 19
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdo operational as raid disk 14
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdn operational as raid disk 13
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdd operational as raid disk 3
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdv operational as raid disk 21
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sda operational as raid disk 0
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdf operational as raid disk 5
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdm operational as raid disk 12
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sde operational as raid disk 4
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdp operational as raid disk 15
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdi operational as raid disk 8
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdl operational as raid disk 11
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdk operational as raid disk 10
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sds operational as raid disk 18
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdb operational as raid disk 1
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdj operational as raid disk 9
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdg operational as raid disk 6
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdr operational as raid disk 17
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdh operational as raid disk 7
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdq operational as raid disk 16
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdw operational as raid disk 22
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: allocated 25534kB
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: raid level 6 active with 24 out of 24 devices, algorithm 2
Jan 27 11:33:14 lnxnfs01 kernel: RAID conf printout:
Jan 27 11:33:14 lnxnfs01 kernel:  --- level:6 rd:24 wd:24
Jan 27 11:33:14 lnxnfs01 kernel:  disk 0, o:1, dev:sda
Jan 27 11:33:14 lnxnfs01 kernel:  disk 1, o:1, dev:sdb
Jan 27 11:33:14 lnxnfs01 kernel:  disk 2, o:1, dev:sdc
Jan 27 11:33:14 lnxnfs01 kernel:  disk 3, o:1, dev:sdd
Jan 27 11:33:14 lnxnfs01 kernel:  disk 4, o:1, dev:sde
Jan 27 11:33:14 lnxnfs01 kernel:  disk 5, o:1, dev:sdf
Jan 27 11:33:14 lnxnfs01 kernel:  disk 6, o:1, dev:sdg
Jan 27 11:33:14 lnxnfs01 kernel:  disk 7, o:1, dev:sdh
Jan 27 11:33:14 lnxnfs01 kernel:  disk 8, o:1, dev:sdi
Jan 27 11:33:14 lnxnfs01 kernel:  disk 9, o:1, dev:sdj
Jan 27 11:33:14 lnxnfs01 kernel:  disk 10, o:1, dev:sdk
Jan 27 11:33:14 lnxnfs01 kernel:  disk 11, o:1, dev:sdl
Jan 27 11:33:14 lnxnfs01 kernel:  disk 12, o:1, dev:sdm
Jan 27 11:33:14 lnxnfs01 kernel:  disk 13, o:1, dev:sdn
Jan 27 11:33:14 lnxnfs01 kernel:  disk 14, o:1, dev:sdo
Jan 27 11:33:14 lnxnfs01 kernel:  disk 15, o:1, dev:sdp
Jan 27 11:33:14 lnxnfs01 kernel:  disk 16, o:1, dev:sdq
Jan 27 11:33:14 lnxnfs01 kernel:  disk 17, o:1, dev:sdr
Jan 27 11:33:14 lnxnfs01 kernel:  disk 18, o:1, dev:sds
Jan 27 11:33:14 lnxnfs01 kernel:  disk 19, o:1, dev:sdt
Jan 27 11:33:14 lnxnfs01 kernel:  disk 20, o:1, dev:sdu
Jan 27 11:33:14 lnxnfs01 kernel:  disk 21, o:1, dev:sdv
Jan 27 11:33:14 lnxnfs01 kernel:  disk 22, o:1, dev:sdw
Jan 27 11:33:14 lnxnfs01 kernel:  disk 23, o:1, dev:sdx
Jan 27 11:33:14 lnxnfs01 kernel: md0: detected capacity change from 0 to 45062020923392
Jan 27 11:33:14 lnxnfs01 systemd[1]: Found device /dev/disk/by-uuid/2b9114be-3d5a-41d7-8d4b-e5047d223129.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started udev Wait for Complete Device Initialization.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started Timer to wait for more drives before activating degraded array..
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Timer to wait for more drives before activating degraded array..
Jan 27 11:33:14 lnxnfs01 systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting system-lvm2\x2dpvscan.slice.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting LVM2 PV scan on device 9:127...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Activation of DM RAID sets...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounting Mount /share RAID partition explicitly...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting File System Check on /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Activating swap /dev/mapper/vg_root-swap...
Jan 27 11:33:14 lnxnfs01 lvm[1494]: 2 logical volume(s) in volume group "vg_root" now active
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started LVM2 PV scan on device 9:127.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Activated swap /dev/mapper/vg_root-swap.
Jan 27 11:33:14 lnxnfs01 kernel: Adding 32763900k swap on /dev/mapper/vg_root-swap.  Priority:-1 extents:1 across:32763900k FS
Jan 27 11:33:14 lnxnfs01 systemd-fsck[1499]: /dev/md126: clean, 345/64128 files, 47930/256240 blocks
Jan 27 11:33:14 lnxnfs01 systemd[1]: Reached target Swap.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Swap.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started File System Check on /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounting /boot...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounted /boot.
Jan 27 11:33:14 lnxnfs01 kernel: EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: (null)
Jan 27 11:33:21 lnxnfs01 systemd[1]: Started Activation of DM RAID sets.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Reached target Encrypted Volumes.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Starting Encrypted Volumes.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Reached target Local File Systems.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Starting Local File Systems.
...
Jan 27 11:33:21 lnxnfs01 kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
Jan 27 11:33:21 lnxnfs01 systemd[1]: Mounted Mount /share RAID partition explicitly.

... + 31 seconds from disk initialization, expiration of 30 second timer from mdadm-last-resort at .timer

Jan 27 11:33:45 lnxnfs01 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped target Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopping Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounting Mount /share RAID partition explicitly...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting Activate md array even though degraded...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped (with error) /dev/md0.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Started Activate md array even though degraded.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounted Mount /share RAID partition explicitly.

... + 121 seconds from disk initialization, expiration of 120 second timer from TimeoutSec=120 in /etc/systemd/system/share.mount

Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-md-0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/md/0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-md-0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2dname\x2dlnxnfs01:0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/disk/by-id/md-name-lnxnfs01:0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2dname\x2dlnxnfs01:0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job sys-devices-virtual-block-md0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /sys/devices/virtual/block/md0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job sys-devices-virtual-block-md0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2duuid\x2d97566d2f:ae7a169b:966f5840:3e8267f9.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/disk/by-id/md-uuid-97566d2f:ae7a169b:966f5840:3e8267f9.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2duuid\x2d97566d2f:ae7a169b:966f5840:3e8267f9.device/stop failed with result 'timeout'.


In the logs above, the timer is started to wait for devices, it does not stop immediately because it sees a degraded array. It continues for 30 seconds and then kicks out my mount believing the raid is problematic. 

As an experiment, I've increased the values for the mdadm-last-resort at .timer to OnActiveSec=300 and TimeoutSec=320. This give me enough time to log in to the system. During that time, I can view the RAID and everything appears proper, yet 300 seconds later the mdadm-last-resort at .timer expires with an error on /dev/md0?

Perhaps systemd is working normally, but then the question is why is the raid not being assembled properly - which triggers /usr/lib/systemd/system/mdadm-last-resort at .service?

From your suggestion I need to next move to full udev debug logs? Can I delay the assembly of the RAID by the kernel if this is a race condition, as it appears that might be the case? Should that delay be early in the systemd startup process or elsewhere?

Thanks again,
Luke Pyzowski


More information about the systemd-devel mailing list