注意事項
建議您使用縮減異常節點並增加新節點的方法來解決此類問題,以避免對業務運行造成長時間的影響。
磁碟更換後,該磁碟上的資料會丟失,請確保磁碟上的資料有足夠的副本,並及時備份。
整個換盤包括服務停止、卸載磁碟、掛載新盤和服務重啟等操作,磁碟的更換通常在五個工作日內完成。執行本文檔前請評估服務停止以後,服務的磁碟水位以及叢集負載能否承載當前的業務。
操作步驟
您可以登入ECS控制台,查看事件具體資訊,包括執行個體ID、狀態、受損磁碟ID、事件進度和相關的操作。
步驟一:擷取損壞的磁碟資訊
通過SSH方式登入壞盤所在節點,詳情請參見登入叢集。
執行以下命令,查看塊裝置資訊。
lsblk
返回如下類似資訊。
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vdd 254:48 0 5.4T 0 disk /mnt/disk3
vdb 254:16 0 5.4T 0 disk /mnt/disk1
vde 254:64 0 5.4T 0 disk /mnt/disk4
vdc 254:32 0 5.4T 0 disk /mnt/disk2
vda 254:0 0 120G 0 disk
└─vda1 254:1 0 120G 0 part /
執行以下命令,查看磁碟資訊。
sudo fdisk -l
返回如下類似資訊。
Disk /dev/vdd: 5905.6 GB, 5905580032000 bytes, 11534336000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
根據前面兩個步驟的返回資訊記錄裝置名稱$device_name
和掛載點$mount_path
。
例如,壞盤事件中的裝置為vdd,擷取到的裝置名稱為/dev/vdd,掛載點為/mnt/disk3。
步驟二:隔離損壞的本地碟
停止對該壞盤有讀寫操作的應用。
在EMR控制台上單擊壞盤所在叢集,在叢集服務頁簽找到對該壞盤有讀寫操作的EMR服務,通常包括HDFS、HBase和Kudu等儲存類服務,選擇目標服務地區的完成服務停止操作。
您也可以在該節點通過sudo fuser -mv $device_name
命令查看佔用磁碟的完整進程列表,並在EMR控制台停止列表中的服務。
執行以下命令,對本地碟設定應用程式層讀寫隔離。
sudo chmod 000 $mount_path
執行以下命令,取消掛載本地碟。
sudo umount $device_name;sudo chmod 000 $mount_path
重要 如果不執行取消掛載操作,在壞盤維修完成並恢複隔離後,該本地碟的對應裝置名稱會發生變化,可能導致應用讀寫錯誤的磁碟。
更新fstab檔案。
備份已有的/etc/fstab檔案。
刪除/etc/fstab檔案中對應磁碟的記錄。
例如,本文樣本中壞掉的磁碟是dev/vdd,所以需要刪除該磁碟對應的記錄。
啟動已停止的應用。
在壞盤所在叢集的叢集服務頁簽找到步驟二中停止的EMR服務,選擇目標服務地區的來啟動目標服務。
步驟三:換盤操作
步驟四:掛載磁碟
磁碟修複完成後,需要重新掛載磁碟,便於使用新磁碟。
執行以下命令,統一裝置名稱。
device_name=`echo "$device_name" | sed 's/x//1'`
上述命令可以將類似/dev/xvdk類的目錄名歸一化,去掉x,修改為/dev/vdk。
執行以下命令,建立掛載目錄。
執行以下命令,掛載磁碟。
mount $device_name $mount_path;sudo chmod 755 $mount_path
如果掛載磁碟失敗,則可以按照以下步驟操作:
執行以下命令,格式化磁碟。
fdisk $device_name << EOF
n
p
1
wq
EOF
執行以下命令,重新掛載磁碟。
mount $device_name $mount_path;sudo chmod 755 $mount_path
執行以下命令,修改fstab檔案。
echo "$device_name $mount_path $fstype defaults,noatime,nofail 0 0" >> /etc/fstab
說明 可以通過which mkfs.ext4
命令,確認ext4是否存在,存在的話$fstype
為ext4,否則$fstype
為ext3。
建立指令檔並根據叢集類型選擇相應指令碼代碼。
DataLake、DataFlow、OLAP、DataServing和Custom叢集
while getopts p: opt
do
case "${opt}" in
p) mount_path=${OPTARG};;
esac
done
sudo mkdir -p $mount_path/flink
sudo chown flink:hadoop $mount_path/flink
sudo chmod 775 $mount_path/flink
sudo mkdir -p $mount_path/hadoop
sudo chown hadoop:hadoop $mount_path/hadoop
sudo chmod 755 $mount_path/hadoop
sudo mkdir -p $mount_path/hdfs
sudo chown hdfs:hadoop $mount_path/hdfs
sudo chmod 750 $mount_path/hdfs
sudo mkdir -p $mount_path/yarn
sudo chown root:root $mount_path/yarn
sudo chmod 755 $mount_path/yarn
sudo mkdir -p $mount_path/impala
sudo chown impala:hadoop $mount_path/impala
sudo chmod 755 $mount_path/impala
sudo mkdir -p $mount_path/jindodata
sudo chown root:root $mount_path/jindodata
sudo chmod 755 $mount_path/jindodata
sudo mkdir -p $mount_path/jindosdk
sudo chown root:root $mount_path/jindosdk
sudo chmod 755 $mount_path/jindosdk
sudo mkdir -p $mount_path/kafka
sudo chown root:root $mount_path/kafka
sudo chmod 755 $mount_path/kafka
sudo mkdir -p $mount_path/kudu
sudo chown root:root $mount_path/kudu
sudo chmod 755 $mount_path/kudu
sudo mkdir -p $mount_path/mapred
sudo chown root:root $mount_path/mapred
sudo chmod 755 $mount_path/mapred
sudo mkdir -p $mount_path/starrocks
sudo chown root:root $mount_path/starrocks
sudo chmod 755 $mount_path/starrocks
sudo mkdir -p $mount_path/clickhouse
sudo chown clickhouse:clickhouse $mount_path/clickhouse
sudo chmod 755 $mount_path/clickhouse
sudo mkdir -p $mount_path/doris
sudo chown root:root $mount_path/doris
sudo chmod 755 $mount_path/doris
sudo mkdir -p $mount_path/log
sudo chown root:root $mount_path/log
sudo chmod 755 $mount_path/log
sudo mkdir -p $mount_path/log/clickhouse
sudo chown clickhouse:clickhouse $mount_path/log/clickhouse
sudo chmod 755 $mount_path/log/clickhouse
sudo mkdir -p $mount_path/log/kafka
sudo chown kafka:hadoop $mount_path/log/kafka
sudo chmod 755 $mount_path/log/kafka
sudo mkdir -p $mount_path/log/kafka-rest-proxy
sudo chown kafka:hadoop $mount_path/log/kafka-rest-proxy
sudo chmod 755 $mount_path/log/kafka-rest-proxy
sudo mkdir -p $mount_path/log/kafka-schema-registry
sudo chown kafka:hadoop $mount_path/log/kafka-schema-registry
sudo chmod 755 $mount_path/log/kafka-schema-registry
sudo mkdir -p $mount_path/log/cruise-control
sudo chown kafka:hadoop $mount_path/log/cruise-control
sudo chmod 755 $mount_path/log/cruise-control
sudo mkdir -p $mount_path/log/doris
sudo chown doris:doris $mount_path/log/doris
sudo chmod 755 $mount_path/log/doris
sudo mkdir -p $mount_path/log/celeborn
sudo chown hadoop:hadoop $mount_path/log/celeborn
sudo chmod 755 $mount_path/log/celeborn
sudo mkdir -p $mount_path/log/flink
sudo chown flink:hadoop $mount_path/log/flink
sudo chmod 775 $mount_path/log/flink
sudo mkdir -p $mount_path/log/flume
sudo chown root:root $mount_path/log/flume
sudo chmod 755 $mount_path/log/flume
sudo mkdir -p $mount_path/log/gmetric
sudo chown root:root $mount_path/log/gmetric
sudo chmod 777 $mount_path/log/gmetric
sudo mkdir -p $mount_path/log/hadoop-hdfs
sudo chown hdfs:hadoop $mount_path/log/hadoop-hdfs
sudo chmod 755 $mount_path/log/hadoop-hdfs
sudo mkdir -p $mount_path/log/hbase
sudo chown hbase:hadoop $mount_path/log/hbase
sudo chmod 755 $mount_path/log/hbase
sudo mkdir -p $mount_path/log/hive
sudo chown root:root $mount_path/log/hive
sudo chmod 775 $mount_path/log/hive
sudo mkdir -p $mount_path/log/impala
sudo chown impala:hadoop $mount_path/log/impala
sudo chmod 755 $mount_path/log/impala
sudo mkdir -p $mount_path/log/jindodata
sudo chown root:root $mount_path/log/jindodata
sudo chmod 777 $mount_path/log/jindodata
sudo mkdir -p $mount_path/log/jindosdk
sudo chown root:root $mount_path/log/jindosdk
sudo chmod 777 $mount_path/log/jindosdk
sudo mkdir -p $mount_path/log/kyuubi
sudo chown kyuubi:hadoop $mount_path/log/kyuubi
sudo chmod 755 $mount_path/log/kyuubi
sudo mkdir -p $mount_path/log/presto
sudo chown presto:hadoop $mount_path/log/presto
sudo chmod 755 $mount_path/log/presto
sudo mkdir -p $mount_path/log/spark
sudo chown spark:hadoop $mount_path/log/spark
sudo chmod 755 $mount_path/log/spark
sudo mkdir -p $mount_path/log/sssd
sudo chown sssd:sssd $mount_path/log/sssd
sudo chmod 750 $mount_path/log/sssd
sudo mkdir -p $mount_path/log/starrocks
sudo chown starrocks:starrocks $mount_path/log/starrocks
sudo chmod 755 $mount_path/log/starrocks
sudo mkdir -p $mount_path/log/taihao_exporter
sudo chown taihao:taihao $mount_path/log/taihao_exporter
sudo chmod 755 $mount_path/log/taihao_exporter
sudo mkdir -p $mount_path/log/trino
sudo chown trino:hadoop $mount_path/log/trino
sudo chmod 755 $mount_path/log/trino
sudo mkdir -p $mount_path/log/yarn
sudo chown hadoop:hadoop $mount_path/log/yarn
sudo chmod 755 $mount_path/log/yarn
資料湖(Hadoop)叢集
while getopts p: opt
do
case "${opt}" in
p) mount_path=${OPTARG};;
esac
done
mkdir -p $mount_path/data
chown hdfs:hadoop $mount_path/data
chmod 1777 $mount_path/data
mkdir -p $mount_path/hadoop
chown hadoop:hadoop $mount_path/hadoop
chmod 775 $mount_path/hadoop
mkdir -p $mount_path/hdfs
chown hdfs:hadoop $mount_path/hdfs
chmod 755 $mount_path/hdfs
mkdir -p $mount_path/yarn
chown hadoop:hadoop $mount_path/yarn
chmod 755 $mount_path/yarn
mkdir -p $mount_path/kudu/master
chown kudu:hadoop $mount_path/kudu/master
chmod 755 $mount_path/kudu/master
mkdir -p $mount_path/kudu/tserver
chown kudu:hadoop $mount_path/kudu/tserver
chmod 755 $mount_path/kudu/tserver
mkdir -p $mount_path/log
chown hadoop:hadoop $mount_path/log
chmod 775 $mount_path/log
mkdir -p $mount_path/log/hadoop-hdfs
chown hdfs:hadoop $mount_path/log/hadoop-hdfs
chmod 775 $mount_path/log/hadoop-hdfs
mkdir -p $mount_path/log/hadoop-yarn
chown hadoop:hadoop $mount_path/log/hadoop-yarn
chmod 755 $mount_path/log/hadoop-yarn
mkdir -p $mount_path/log/hadoop-mapred
chown hadoop:hadoop $mount_path/log/hadoop-mapred
chmod 755 $mount_path/log/hadoop-mapred
mkdir -p $mount_path/log/kudu
chown kudu:hadoop $mount_path/log/kudu
chmod 755 $mount_path/log/kudu
mkdir -p $mount_path/run
chown hadoop:hadoop $mount_path/run
chmod 777 $mount_path/run
mkdir -p $mount_path/tmp
chown hadoop:hadoop $mount_path/tmp
chmod 777 $mount_path/tmp
執行以下命令運行指令檔建立服務類別目錄並刪除指令碼,$file_path
為指令檔路徑。
chmod +x $file_path
sudo $file_path -p $mount_path
rm $file_path
使用新磁碟。
在EMR控制台重啟在該節點上啟動並執行服務,並檢查磁碟是否正常使用。