Replace a damaged local disk for an EMR cluster - E-MapReduce

If a local disk in an E-MapReduce (EMR) cluster is damaged, you will receive a notification about the damage when you use the EMR cluster. Local disks are used in d-series and i-series instance families. This topic describes how to replace a damaged local disk for an EMR cluster.

Precautions

We recommend that you use the method of removing abnormal nodes and adding normal nodes to prevent your business from being affected for a long period of time.
After a disk is replaced, data on the disk is lost. You must make sure that the data has replicas and is backed up before you replace the disk.
To replace a damaged local disk, you need to stop services, unmount the damaged disk, mount a new disk, and restart services. In most cases, the entire replacement process can be complete within five business days. Before you perform the operations described in this topic, you must evaluate the disk usage of the related services and the cluster load and determine whether the cluster can handle the business traffic after the services are stopped.

Procedure

You can log on to the Elastic Compute Service (ECS) console to view the event details, including the instance ID, instance status, damaged disk ID, event progress, and information about related operations.

Step 1: Obtain information about the damaged disk

Log on to the node where the damaged disk is deployed in SSH mode. For more information, see Log on to a cluster.

Run the following command to view the block device information:

lsblk

Information similar to the following output is returned:

NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vdd    254:48   0  5.4T  0 disk /mnt/disk3
vdb    254:16   0  5.4T  0 disk /mnt/disk1
vde    254:64   0  5.4T  0 disk /mnt/disk4
vdc    254:32   0  5.4T  0 disk /mnt/disk2
vda    254:0    0  120G  0 disk
└─vda1 254:1    0  120G  0 part /

Run the following command to view the disk information:

sudo fdisk -l

Information similar to the following output is returned:

Disk /dev/vdd: 5905.6 GB, 5905580032000 bytes, 11534336000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Record the device name and mount point of the damaged disk in the returned information in the preceding two steps. You need to use the device name and mount point to replace $device_name and $mount_path in subsequent commands.
In this example, the device name is /dev/vdd, and the mount point is /mnt/disk3.

Step 2: Isolate the damaged local disk

Stop the services that read data from or write data to the damaged disk.
Find the desired cluster in the EMR console and go to the Services tab of the cluster. On the Services tab, find the services that read data from or write data to the damaged local disk and then stop the services. In most cases, you need to stop storage services such as Hadoop Distributed File System (HDFS), HBase, and Kudu. To stop a service, perform the following operations: On the Services tab, find the desired service, move the pointer over the icon and select Stop. In the dialog box that appears, configure the Execution Reason parameter and click OK. In the Confirm message, click OK.
You can run the sudo fuser -mv $device_name command on the node where the damaged local disk is deployed to view processes that occupy the disk and then stop the related services in the EMR console.
Run the following command to deny the read and write operations on the damaged local disk at the application layer:
```
sudo chmod 000 $mount_path
```
Run the following command to unmount the damaged local disk:
```
sudo umount $device_name;sudo chmod 000 $mount_path
```
Important
The device name of the local disk changes after it is repaired. Therefore, if you do not unmount the local disk in this step, services may read data from or write data to a wrong disk after the local disk is repaired and used as a new disk.
Update the fstab file.
1. Back up the fstab file in the /etc directory.
2. Delete records related to the damaged local disk from the fstab file in the /etc directory.
  In this example, records related to the disk dev/vdd are deleted.
Start the services that you stopped.
Find the services that you stopped in Step 2 on the Services tab of the desired cluster and then start the services. To start a service, perform the following operations: On the Services tab, find the desired service, move the pointer over the icon and select Start. In the dialog box that appears, configure the Execution Reason parameter and click OK. In the Confirm message, click OK.

Step 3: Replace the disk

Repair the disk in the ECS console. For more information, see Isolate damaged local disks in the ECS console.

Step 4: Mount the new disk

After the disk is repaired, you need to mount the disk as a new disk.

Run the following command to normalize device names:
```
device_name=`echo "$device_name" | sed 's/x//1'`
```
The preceding command normalizes the directory names such as /dev/xvdk with the letter x removed. The name is changed to /dev/vdk.
Run the following command to create a mount directory:
```
 mkdir -p "$mount_path"
```
Run the following command to mount the disk to the mount directory:
```
mount $device_name $mount_path;sudo chmod 755 $mount_path
```
If the disk fails to be mounted, perform the following steps:
1. Run the following command to format the disk:
```
fdisk $device_name << EOF
n
p
1

wq
EOF
```
2. Run the following command to remount the disk:
```
mount $device_name $mount_path;sudo chmod 755 $mount_path
```
Run the following command to modify the fstab file:
```
echo "$device_name $mount_path $fstype defaults,noatime,nofail 0 0" >> /etc/fstab
```
Note
You can run the which mkfs.ext4 command to check whether ext4 exists. If ext4 exists, replace $fstype with ext4. If ext4 does not exist, replace $fstype with ext3.

Create a script file and select code based on the cluster type.

Hadoop cluster in the original data lake scenario:

while getopts p: opt
do
	case "${opt}" in
  	p) mount_path=${OPTARG};;
  esac
done

mkdir -p $mount_path/data
chown hdfs:hadoop $mount_path/data
chmod 1777 $mount_path/data

mkdir -p $mount_path/hadoop
chown hadoop:hadoop $mount_path/hadoop
chmod 775 $mount_path/hadoop

mkdir -p $mount_path/hdfs
chown hdfs:hadoop $mount_path/hdfs
chmod 755 $mount_path/hdfs

mkdir -p $mount_path/yarn
chown hadoop:hadoop $mount_path/yarn
chmod 755 $mount_path/yarn

mkdir -p $mount_path/kudu/master
chown kudu:hadoop $mount_path/kudu/master
chmod 755 $mount_path/kudu/master

mkdir -p $mount_path/kudu/tserver
chown kudu:hadoop $mount_path/kudu/tserver
chmod 755 $mount_path/kudu/tserver

mkdir -p $mount_path/log
chown hadoop:hadoop $mount_path/log
chmod 775 $mount_path/log

mkdir -p $mount_path/log/hadoop-hdfs
chown hdfs:hadoop $mount_path/log/hadoop-hdfs
chmod 775 $mount_path/log/hadoop-hdfs

mkdir -p $mount_path/log/hadoop-yarn
chown hadoop:hadoop $mount_path/log/hadoop-yarn
chmod 755 $mount_path/log/hadoop-yarn

mkdir -p $mount_path/log/hadoop-mapred
chown hadoop:hadoop $mount_path/log/hadoop-mapred
chmod 755 $mount_path/log/hadoop-mapred

mkdir -p $mount_path/log/kudu
chown kudu:hadoop $mount_path/log/kudu
chmod 755 $mount_path/log/kudu

mkdir -p $mount_path/run
chown hadoop:hadoop $mount_path/run
chmod 777 $mount_path/run

mkdir -p $mount_path/tmp
chown hadoop:hadoop $mount_path/tmp
chmod 777 $mount_path/tmp

Other clusters:

while getopts p: opt
do
	case "${opt}" in
  	p) mount_path=${OPTARG};;
  esac
done

sudo mkdir -p $mount_path/flink
sudo chown flink:hadoop $mount_path/flink
sudo chmod 775 $mount_path/flink

sudo mkdir -p $mount_path/hadoop
sudo chown hadoop:hadoop $mount_path/hadoop
sudo chmod 755 $mount_path/hadoop

sudo mkdir -p $mount_path/hdfs
sudo chown hdfs:hadoop $mount_path/hdfs
sudo chmod 750 $mount_path/hdfs

sudo mkdir -p $mount_path/yarn
sudo chown root:root $mount_path/yarn
sudo chmod 755 $mount_path/yarn

sudo mkdir -p $mount_path/impala
sudo chown impala:hadoop $mount_path/impala
sudo chmod 755 $mount_path/impala

sudo mkdir -p $mount_path/jindodata
sudo chown root:root $mount_path/jindodata
sudo chmod 755 $mount_path/jindodata

sudo mkdir -p $mount_path/jindosdk
sudo chown root:root $mount_path/jindosdk
sudo chmod 755 $mount_path/jindosdk

sudo mkdir -p $mount_path/kafka
sudo chown root:root $mount_path/kafka
sudo chmod 755 $mount_path/kafka

sudo mkdir -p $mount_path/kudu
sudo chown root:root $mount_path/kudu
sudo chmod 755 $mount_path/kudu

sudo mkdir -p $mount_path/mapred
sudo chown root:root $mount_path/mapred
sudo chmod 755 $mount_path/mapred

sudo mkdir -p $mount_path/starrocks
sudo chown root:root $mount_path/starrocks
sudo chmod 755 $mount_path/starrocks

sudo mkdir -p $mount_path/clickhouse
sudo chown clickhouse:clickhouse $mount_path/clickhouse
sudo chmod 755 $mount_path/clickhouse

sudo mkdir -p $mount_path/doris
sudo chown root:root $mount_path/doris
sudo chmod 755 $mount_path/doris

sudo mkdir -p $mount_path/log
sudo chown root:root $mount_path/log
sudo chmod 755 $mount_path/log

sudo mkdir -p $mount_path/log/clickhouse
sudo chown clickhouse:clickhouse $mount_path/log/clickhouse
sudo chmod 755 $mount_path/log/clickhouse

sudo mkdir -p $mount_path/log/kafka
sudo chown kafka:hadoop $mount_path/log/kafka
sudo chmod 755 $mount_path/log/kafka

sudo mkdir -p $mount_path/log/kafka-rest-proxy
sudo chown kafka:hadoop $mount_path/log/kafka-rest-proxy
sudo chmod 755 $mount_path/log/kafka-rest-proxy

sudo mkdir -p $mount_path/log/kafka-schema-registry
sudo chown kafka:hadoop $mount_path/log/kafka-schema-registry
sudo chmod 755 $mount_path/log/kafka-schema-registry

sudo mkdir -p $mount_path/log/cruise-control
sudo chown kafka:hadoop $mount_path/log/cruise-control
sudo chmod 755 $mount_path/log/cruise-control

sudo mkdir -p $mount_path/log/doris
sudo chown doris:doris $mount_path/log/doris
sudo chmod 755 $mount_path/log/doris

sudo mkdir -p $mount_path/log/celeborn
sudo chown hadoop:hadoop $mount_path/log/celeborn
sudo chmod 755 $mount_path/log/celeborn

sudo mkdir -p $mount_path/log/flink
sudo chown flink:hadoop $mount_path/log/flink
sudo chmod 775 $mount_path/log/flink

sudo mkdir -p $mount_path/log/flume
sudo chown root:root $mount_path/log/flume
sudo chmod 755 $mount_path/log/flume

sudo mkdir -p $mount_path/log/gmetric
sudo chown root:root $mount_path/log/gmetric
sudo chmod 777 $mount_path/log/gmetric

sudo mkdir -p $mount_path/log/hadoop-hdfs
sudo chown hdfs:hadoop $mount_path/log/hadoop-hdfs
sudo chmod 755 $mount_path/log/hadoop-hdfs

sudo mkdir -p $mount_path/log/hbase
sudo chown hbase:hadoop $mount_path/log/hbase
sudo chmod 755 $mount_path/log/hbase

sudo mkdir -p $mount_path/log/hive
sudo chown root:root $mount_path/log/hive
sudo chmod 775 $mount_path/log/hive

sudo mkdir -p $mount_path/log/impala
sudo chown impala:hadoop $mount_path/log/impala
sudo chmod 755 $mount_path/log/impala

sudo mkdir -p $mount_path/log/jindodata
sudo chown root:root $mount_path/log/jindodata
sudo chmod 777 $mount_path/log/jindodata

sudo mkdir -p $mount_path/log/jindosdk
sudo chown root:root $mount_path/log/jindosdk
sudo chmod 777 $mount_path/log/jindosdk

sudo mkdir -p $mount_path/log/kyuubi
sudo chown kyuubi:hadoop $mount_path/log/kyuubi
sudo chmod 755 $mount_path/log/kyuubi

sudo mkdir -p $mount_path/log/presto
sudo chown presto:hadoop $mount_path/log/presto
sudo chmod 755 $mount_path/log/presto

sudo mkdir -p $mount_path/log/spark
sudo chown spark:hadoop $mount_path/log/spark
sudo chmod 755 $mount_path/log/spark

sudo mkdir -p $mount_path/log/sssd
sudo chown sssd:sssd $mount_path/log/sssd
sudo chmod 750 $mount_path/log/sssd

sudo mkdir -p $mount_path/log/starrocks
sudo chown starrocks:starrocks $mount_path/log/starrocks
sudo chmod 755 $mount_path/log/starrocks

sudo mkdir -p $mount_path/log/taihao_exporter
sudo chown taihao:taihao $mount_path/log/taihao_exporter
sudo chmod 755 $mount_path/log/taihao_exporter

sudo mkdir -p $mount_path/log/trino
sudo chown trino:hadoop $mount_path/log/trino
sudo chmod 755 $mount_path/log/trino

sudo mkdir -p $mount_path/log/yarn
sudo chown hadoop:hadoop $mount_path/log/yarn
sudo chmod 755 $mount_path/log/yarn

Run the following commands to run the script file, create service catalogs, and then delete the script file. $file_path specifies the path in which the script file is stored.
```
chmod +x $file_path
sudo $file_path -p $mount_path
rm $file_path
```
Use the new disk.
Restart the services that are running on the node in the EMR console and check whether the disk works properly.