Software RAID
From UF HPC Wiki
Software RAID fsck
We have run into a situation where the root filesystem of a node that had a RAID 1 configuration had just enough corruption that it would not boot. In order to fsck the system, we had to boot the system via a rescue disk and then fsck from there. The procedure went something like this:
- Boot system with repair/rescue disk
- Allow rescue system to detect the installation
- NOTE: Be very careful if there are other disks in the system! In this case, we had twelve LUN's that were Lustre based that the rescue system did not recognize. By default the rescue system considers drives that do not have a partition table to be unitialized, and the default response is to initialize these disks with a partition table. This is not something you may want to do, so be VERY careful with something like this.
- Once the system has recognized your disk, you will be dumped to a prompt.
- You then need to unmount the filesystem it has detected. This can be a tricky business, because it mounts a large number of items from the system disk that you may not realize are mounted, and thus the unmounting process fails. You will probably have the following mounted:
- /mnt/system/dev
- /mnt/system/proc
- /mnt/system/sys
- /mnt/system/boot
- /mnt/system
- For the above, you want to unount everything that is mounted on /mnt/system prior to unmounting /mnt/system, else you will be present with a somewhat cryptic error message that doesn't tell you anything about WHY it won't unmount. The list of items that are mounted will be listed in /proc/mount
- Once everything is unmounted, you should be able to run fsck on the md devices. In our case we had to run fsck.ext3 -f on the filesystem in order to detect what was really wrong with it.
- Once we were done with this, we rebooted and the system booted fine, except that we had inexplicably lost the tr command, which caused a number of services to fail their start. This command was recovered by reinstalling the associated RPM.
