View Issue Details

IDProjectCategoryView StatusLast Update
0000227AlmaLinux-8systemdpublic2022-05-05 01:45
Reporterm4 Assigned To 
Status newResolutionopen 
Platformx86_64OSAlmaLinuxOS Version8.5
Summary0000227: System will not boot if raid goes into degraded mode while system is down, and "root" is not physically on those drives.
DescriptionIf a raid1 (or other) that is not on the physical disks that 'root' is on, goes degraded while system is powered down -- and it is required for system to run (/etc/fstab listed), the system goes into dracut emergency mode, and "mdadm -R /dev/mdXYZ" needs to be run manually.

If both the root disk(s) and a raid on a non-root physical disk goes degraded at the same time -- then "perhaps" both will go into degraded mode.

See Additional information.
Steps To ReproduceSystem testing setup:
1) root raid1 on two SAS disks (sda/sdb ) and other raids on those too (home,var,etc.).
2) two other disks with a required (/etc/fstab) raid1 - say /media/private_date, lets say /dev/nvme0n1p1, and nvme1n1p1.
Test cases to see if system runs with disks degraded:
A) system running:
    a) unplug one SAS. cat /proc/mdstat, etc. Plug it back in. Fix if needed via mdadm. Repeat with other.
    b) unplug one NVMe. Etc. per "a)".
B) Everything running nicely, powerdown:
   c) unplug one SAS. Power up, etc.
   d) everything is running nicely, then power down, unplug one NVMe. Power up, etc.

With B)d) - the system will go into dracut emergency recovery mode - every time. Remember to have system running nicely before powerdown. :)
Additional InformationWhen root from initramfs layout is running and it is looking for the 'real root', the various raids are discovered, and systemd rules cause "mdadm -I /dev/ABCDWYXZ --export to be run" the first (and only FIRST) time that a software raid is created, which sets up a last-resort timer in case all members of the raid are not present -- while continuing to process udev devices to find disks/partitions. As soon as "root" is found and active (could be degraded when system went down, which is no different that having all disks present), the real root is setup for pivot root, systemd is taken down, the pivot root occurs, and at this time all the last-resort timers outstanding are lost. The real systemd starts and redoes the device discovery udev stuff -- but, any raid already known to the kernel already (root is important!), no last-resort timers are set.

Any raid on the same physical devices as the root raid, "may" (most likely) will already have all it's devices found and last-resort timers will have gone off -- if one goes off for root. With the long time for a raid to become active -- most likely those raids are also active.

Continuing thought process (ignoring side tracking paragraph above), when systemd is busy processing /etc/fstab for mounting file systems (asynchronously), everything is nice -- except for the mandatory raid that went into degraded mode while powered off. There is no last-resort timer to go off, and it will sit there inactive.

I am attaching a work-around. Possible fixes: 1) clean out all md raids from kernel data structures (except root -- nested raids, etc. present complexity) when systemd goes down to restart up on the real root file system. 2) copy the last-resort timers into the real root structure before shutting systemd down -- and make sure they will still be run when the real systemd comes up. 3) Always have mdadm --export return the environment style variables -- not just when a /dev/mdXYZ is created for the first device found. 4) Eliminate the use of initramfs and taking systemd down and replaying the device discovery (udev) process.

I don't know, all of this is convoluted ... and I expect other similar situations exist with all the messiness involved.
TagsNo tags attached.



2022-05-05 01:44


65-md-incremental.rules (3,840 bytes)   
# This file causes block devices with Linux RAID (mdadm) signatures to
# automatically cause mdadm to be run.
# See udev(8) for syntax

# Don't process any events if anaconda is running as anaconda brings up
# raid devices manually
ENV{ANACONDA}=="?*", GOTO="md_end"

# Also don't process disks that are slated to be a multipath device

# We process add events on block devices (since they are ready as soon as
# they are added to the system), but we must process change events as well
# on any dm devices (like LUKS partitions or LVM logical volumes) and on
# md devices because both of these first get added, then get brought live
# and trigger a change event.  The reason we don't process change events
# on bare hard disks is because if you stop all arrays on a disk, then
# run fdisk on the disk to change the partitions, when fdisk exits it
# triggers a change event, and we want to wait until all the fdisks on
# all member disks are done before we do anything.  Unfortunately, we have
# no way of knowing that, so we just have to let those arrays be brought
# up manually after fdisk has been run on all of the disks.

# First, process all add events (md and dm devices will not really do
# anything here, just regular disks, and this also won't get any imsm
# array members either)
SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \
	IMPORT{program}="/sbin/mdadm -I $env{DEVNAME} --export $devnode --offroot $env{DEVLINKS}"
SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \
	ENV{MD_STARTED}=="*unsafe*", ENV{MD_FOREIGN}=="no", ENV{SYSTEMD_WANTS}+="[email protected]$env{MD_DEVICE}.timer"
# Start a background job that sleeps, then does a "mdadm -R" like timeout would if those rules worked. :)
SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", PROGRAM="/usr/local/sbin/md-run-timer"
SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
	ENV{ID_FS_TYPE}=="linux_raid_member", \
	RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}!="?*", \
	ENV{ID_FS_TYPE}=="linux_raid_member", \
	RUN+="/sbin/mdadm -If $name"

# Next, check to make sure the BIOS raid stuff wasn't turned off via cmdline
ENV{noiswmd}=="?*", GOTO="md_imsm_inc_end"
ENV{nodmraid}=="?*", GOTO="md_imsm_inc_end"
SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="isw_raid_member", \
	RUN+="/sbin/mdadm -I $env{DEVNAME}"
SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="ddf_raid_member", \
	RUN+="/sbin/mdadm -I $env{DEVNAME}"
SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}=="?*", \
	ENV{ID_FS_TYPE}=="isw_raid_member", \
	RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
SUBSYSTEM=="block", ACTION=="remove", ENV{ID_PATH}!="?*", \
	ENV{ID_FS_TYPE}=="isw_raid_member", \
	RUN+="/sbin/mdadm -If $name"

# Next make sure that this isn't a dm device we should skip for some reason
ENV{DM_UDEV_RULES_VSN}!="?*", GOTO="dm_change_end"
ENV{DM_SUSPENDED}=="1", GOTO="dm_change_end"
KERNEL=="dm-*", SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="linux_raid_member", \
	ACTION=="change", RUN+="/sbin/mdadm -I $env{DEVNAME}"

# Finally catch any nested md raid arrays.  If we brought up an md raid
# array that's part of another md raid array, it won't be ready to be used
# until the change event that occurs when it becomes live
KERNEL=="md*", SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="linux_raid_member", \
	ACTION=="change", RUN+="/sbin/mdadm -I $env{DEVNAME}"

65-md-incremental.rules (3,840 bytes)   
dracut.conf (207 bytes)   
# PUT YOUR CONFIG IN separate files
# in /etc/dracut.conf.d named "<name>.conf"
# SEE man dracut.conf(5) for options

install_items="/usr/local/sbin/md-run-timer /etc/udev/rules.d/65-md-incremental.rules"

dracut.conf (207 bytes)   
md-run-timer (728 bytes)   
    F=`builtin cd /sys/devices/virtual/block/; echo md*`
    for RAID in $F; do
	if [ -e "/sys/devices/virtual/block/$RAID/md/array_state" ]; then
	    read v < /sys/devices/virtual/block/$RAID/md/array_state
	    if [ "$v" == "inactive" ]; then
		echo "md-run-timer $RAID is inactive, attempting to start" > /dev/kmsg
		/usr/sbin/mdadm -R /dev/$RAID > /dev/kmsg
		sleep 1
    unset F
    unset RAID
    unset v
(sleep 15; doit || true ) &
exit 0
md-run-timer (728 bytes)   

Issue History

Date Modified Username Field Change
2022-05-05 01:44 m4 New Issue
2022-05-05 01:44 m4 File Added: 65-md-incremental.rules
2022-05-05 01:44 m4 File Added: dracut.conf
2022-05-05 01:44 m4 File Added: md-run-timer