Replacing a Disk in RAID

Mounting a File System

After booting into Rescue mode, you need to manually mount the file system (if this did not happen automatically). To do this, run the following:

infiltrate-root

If infiltrate-root did not work, then one of the reasons may be that the RAID has not been assembled.

Check the path by running the following:

fdisk -l

If there are no md devices there but sda, sdb, and so on have partitions with the “Linux RAID” type, then you need to assemble the RAID.

To access the file system:

  1. Boot into Rescue.
  2. Run the following:

    mdadm --assemble /dev/md0 /dev/sda2 /dev/sdb2
    mdadm --assemble /dev/md1 /dev/sda3 /dev/sdb3

    Please note that you need to adapt titles to match the existing disks. See more about mdadm.

  3. When mounting, specify the direct path to the vg-root partition as an argument by running the following:

    infiltrate-root /dev/mapper/vg0-root

Example of Disk Replacement

The server has 2 disks: /dev/sda and /dev/sdb. These disks are assembled into software RAID1 using mdadm.

Let’s say one of the disks failed, for example, /dev/sdb.

Removing a Disk From the Array

Please note that before replacing a disk, it is advisable to remove it from the array.

View the array state by running the following:

cat /proc/mdstat 

Personalities : [raid1] 
md1 : active raid1 sda3[0] sdb3[1]
      975628288 blocks super 1.2 [2/2] [UU]
      bitmap: 3/8 pages [12KB], 65536KB chunk

md0 : active raid1 sda2[2] sdb2[1]
      999872 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

In this case, the array is assembled so that md0 consists of sda2 and sdb2, and md1 consists of sda3 and sdb3В.

On this server, md0 is /boot, and md1 is swap and root.

lsblk
NAME             MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0              7:0    0   985M  1 loop  
sda                8:0    0 931.5G  0 disk  
├─sda1             8:1    0     1M  0 part  
├─sda2             8:2    0   977M  0 part  
│ └─md0            9:0    0 976.4M  0 raid1 
└─sda3             8:3    0 930.6G  0 part  
  └─md1            9:1    0 930.4G  0 raid1 
    ├─vg0-swap_1 253:0    0   4.8G  0 lvm   
    └─vg0-root   253:1    0 925.7G  0 lvm   /
sdb                8:16   0 931.5G  0 disk  
├─sdb1             8:17   0     1M  0 part  
├─sdb2             8:18   0   977M  0 part  
│ └─md0            9:0    0 976.4M  0 raid1 
└─sdb3             8:19   0 930.6G  0 part  
  └─md1            9:1    0 930.4G  0 raid1 
    ├─vg0-swap_1 253:0    0   4.8G  0 lvm   
    └─vg0-root   253:1    0 925.7G  0 lvm   /

Remove sdb from all devices:

mdadm /dev/md0 --remove /dev/sdb2
mdadm /dev/md1 --remove /dev/sdb3

If partitions are not removed from the array (as in this case), mdadm does not consider the disk to be failed and uses it. When removing a disk, an error is displayed that the device is in use.

In this case, mark the disk as failed before removing it:

mdadm /dev/md0 -f /dev/sdb2
mdadm /dev/md1 -f /dev/sdb3

Run the commands to remove partitions from the array again.

After removing the failed disk from the array, request disk replacement by creating a ticket specifying the s/n of the failed disk. Downtime availability depends on server configuration.

Defining the Partition Table (GPT or MBR) and Moving It to the New Disk

After replacing the failed disk, you need to add the new disk to the array. To do this, you need to determine the partition table type: GPT or MBR. The gdisk is used for this.

Install the gdisk:

apt-get install gdisk -y

Run the following:

gdisk -l /dev/sda

, where /dev/sda is a healthy disk in the raid.

The output looks something like this for MBR:

Partition table scan:
MBR: MBR only
BSD: not present
APM: not present
GPT: not present

And something like this for GPT:

Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present

Before adding a disk to the array, you need to create the same partitions as on sda. This process varies depending on the disk layout.

Please note that the disk that the layout is copied to is written first, and the disk that the layout is copied from is the second. If you swap them, the layout on the initially healthy disk will be destroyed.

Copying the Partition Layout for GPT:

sgdisk -R /dev/sdb /dev/sda

Assign a new random UIDD:

sgdisk -G /dev/sdb

Please note that here the process is opposite. The disk that the layout is copied from is written first, and the disk that the layout is copied to is the second.

Copying the Partition Layout for MBR:

sfdisk -d /dev/sda | sfdisk /dev/sdb

If you cannot see the partitions in the system, then you can re-read the partition table by running the following:

sfdisk -R /dev/sdb

Adding a Disk to the Array

When partitions on /dev/sdb are created, you can add the disk to the array:

mdadm /dev/md0 -a /dev/sdb2
mdadm /dev/md1 -a /dev/sdb3

After adding the disk to the array, synchronization starts. The speed depends on the disk size and type (ssd/hdd).

cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sda3[1] sdb3[0]
      975628288 blocks super 1.2 [2/1] [U_]
      [============>........]  recovery = 64.7% (632091968/975628288) finish=41.1min speed=139092K/sec
      bitmap: 3/8 pages [12KB], 65536KB chunk

md0 : active raid1 sda2[2] sdb2[1]
      999872 blocks super 1.2 [2/2] [UU]
      
unused devices: <none>

Installing a Boot Loader

After adding the disk to the array, you need to install a boot loader on it.

If the server is booted into normal mode or in infiltrate-root (which we entered earlier), this can be done by running the following:

grub-install /dev/sdb

If the server is booted into recovery or rescue mode, i.e. with a live cd, the boot loader installation looks like this:

  1. Mount the root file system to /mnt:

    mount /dev/md2 /mnt
  2. Mount boot:

    mount /dev/md0 /mnt/boot
  3. Mount /dev, /proc, and /sys:

    mount --bind /dev /mnt/dev
    mount --bind /proc /mnt/proc
    mount --bind /sys  /mnt/sys
  4. Chroot into the mounted file system:

    chroot /mnt
  5. Install grub on sdb:

    grub-install /dev/sdb

Now you can try to boot into normal mode.

Replacing a Failed Disk

You can conditionally make the disk in the array failed using --fail (-f):

mdadm /dev/md0 --fail /dev/sda1
mdadm /dev/md0 -f     /dev/sda1

You can remove the failed disk using --remove (-r):

mdadm /dev/md0 --remove /dev/sda1
mdadm /dev/md0 -r       /dev/sda1

You can add a new disk in the array using --add (-a) and --re-add:

mdadm /dev/md0 --add /dev/sda1
mdadm /dev/md0 -a    /dev/sda1

Error while Restoring the Boot Loader after Replacing the Disk in RAID1

If the following error appears while installing grub:

root #grub-install --root-directory=/boot /dev/sda
Could not find device for /boot/boot: not found or not a block device

Run the following:

root #grep -v rootfs /proc/mounts > /etc/mtab