模板名稱
ACS-ECS-RescueUnreachableInstance-Linux 自助救治損傷的ECS執行個體Linux系統硬碟
模板描述
使用ECS執行個體時,有些情況可能導致系統硬碟損傷,比如執行個體被強制地停止或重啟,抑或突然發生了宕機,以及資料盤被卸載後未更新/etc/fstab,甚至於/etc/fstab或initrd檔案丟失或損壞。當無法訪問執行個體時,該執行個體在ECS執行個體控制台顯示的狀態可能還是運行中,但執行個體內的應用不可訪問,執行個體內的網路不可達,更無法通過workbench或者ssh建立串連。如果您在控制台通過vnc能串連上執行個體,看到的頁面大概是系統啟動失敗的提示資訊。此時您可考慮執行該模板對損傷執行個體進行救治,救治流程主要是損傷的執行個體的系統硬碟將被掛載到新建立的臨時執行個體上,接著在臨時執行個體中會執行一段救治指令碼,最後救治過的系統硬碟將被掛載回原執行個體
模板類型
自動化
所有者
Alibaba Cloud
輸入參數
參數名稱 | 描述 | 類型 | 是否必填 | 預設值 | 約束 |
unreachableInstanceId | 將自救的ECS執行個體ID | String | 是 | ||
credentialType | 登入憑證類型 | String | 是 | ||
credentialValue | 登入憑證 | String | 是 | ||
imagePrefix | 用來備份ECS執行個體的鏡像名稱的首碼 | String | 否 | OOSRescueBackup- | |
helperInstanceTypes | 自救執行過程中,被建立的臨時執行個體規格的選擇範圍 | List | 否 | [‘ecs.t6-c4m1.large’, ‘ecs.t5-lc1m1.small’, ‘ecs.t5-lc1m2.small’, ‘ecs.s6-c1m1.small’, ‘ecs.s6-c1m2.small’, ‘ecs.n1.small’, ‘ecs.mn4.small’, ‘ecs.e3.small’, ‘ecs.e4.small’, ‘ecs.n2.small’, ‘ecs.n4.small’, ‘ecs.t5-lc1m2.large’, ‘ecs.c6.large’, ‘ecs.sn2ne.large’] | |
OOSAssumeRole | OOS扮演的RAM角色 | String | 否 | “” |
輸出參數
參數名稱 | 描述 | 類型 |
diskId | String | |
imageId | String | |
rtCommandOutput | String | |
finalHelperInstanceType | String |
執行此模板需要的權限原則
{
"Version": "1",
"Statement": [
{
"Action": [
"ecs:AddTags",
"ecs:AttachDisk",
"ecs:CreateImage",
"ecs:DescribeAvailableResource",
"ecs:DescribeDisks",
"ecs:DescribeImages",
"ecs:DescribeInstances",
"ecs:DescribeInvocationResults",
"ecs:DescribeInvocations",
"ecs:DetachDisk",
"ecs:RunCommand",
"ecs:StartInstance",
"ecs:StopInstance"
],
"Resource": "*",
"Effect": "Allow"
},
{
"Action": [
"ros:CreateStack",
"ros:DeleteStack",
"ros:GetStack"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
詳情
ACS-ECS-RescueUnreachableInstance-Linux詳情
模板內容
FormatVersion: OOS-2019-06-01
Description:
en: 'When using ECS instances, some situations can lead to system disk corruption, such as instances being forced to stop or restart, or sudden downtime, failure to update /etc/fstab when the disk is unloaded, or even loss or corruption of /etc/fstab or initrd files. When the instance cannot be accessed, the state of the instance displayed in the ECS instance console may still be Running, but the application in the instance cannot be accessed, the network in the instance cannot be reached, and the connection cannot be established through workbench or SSH. If you can connect to an instance from the console with VNC, you will probably see a page that indicates a system startup failure. At this point, you can consider executing the template to cure the damaged instance. The cure process is that the system disk of the damaged instance will be mounted to the newly created temporary instance, then a cure script will be executed in the temporary instance, and finally the cured system disk will be mounted back to the original instance.'
zh-cn: 使用ECS執行個體時,有些情況可能導致系統硬碟損傷,比如執行個體被強制地停止或重啟,抑或突然發生了宕機,以及資料盤被卸載後未更新/etc/fstab,甚至於/etc/fstab或initrd檔案丟失或損壞。當無法訪問執行個體時,該執行個體在ECS執行個體控制台顯示的狀態可能還是運行中,但執行個體內的應用不可訪問,執行個體內的網路不可達,更無法通過workbench或者ssh建立串連。如果您在控制台通過vnc能串連上執行個體,看到的頁面大概是系統啟動失敗的提示資訊。此時您可考慮執行該模板對損傷執行個體進行救治,救治流程主要是損傷的執行個體的系統硬碟將被掛載到新建立的臨時執行個體上,接著在臨時執行個體中會執行一段救治指令碼,最後救治過的系統硬碟將被掛載回原執行個體
name-en: ACS-ECS-RescueUnreachableInstance-Linux
name-zh-cn: 自助救治損傷的ECS執行個體Linux系統硬碟
categories:
- diagnose
Parameters:
unreachableInstanceId:
Label:
en: UnreachableInstanceId
zh-cn: 將自救的ECS執行個體ID
Type: String
AssociationProperty: ALIYUN::ECS::Instance::InstanceId
AssociationPropertyMetadata:
RegionId: '{{ ACS::RegionId }}'
credentialType:
Description:
en: 'Credential Type for your unreachable ECS instance after being rescued, either KeyPairName or Password type can be chosen.'
zh-cn: 當執行自救ECS執行個體後,執行個體root登入憑證類型,可選擇金鑰組或自訂密碼
Label:
en: CredentialType
zh-cn: 登入憑證類型
Type: String
AllowedValues:
- KeyPairName
- Password
credentialValue:
Description:
en: 'Credential value for your unreachable ECS instance after being rescued, the value of KeyPairName or Password.'
zh-cn: 當執行自救ECS執行個體後,執行個體root登入憑證值,如果憑證類型選擇金鑰組,則此處填寫金鑰組名稱;如果憑證類型選擇了自訂密碼,則此處填寫將設定的密碼
Type: String
Label:
en: Credential
zh-cn: 登入憑證
imagePrefix:
Label:
en: ImagePrefix
zh-cn: 用來備份ECS執行個體的鏡像名稱的首碼
Type: String
Default: OOSRescueBackup-
helperInstanceTypes:
Label:
en: HelperInstanceTypes
zh-cn: 自救執行過程中,被建立的臨時執行個體規格的選擇範圍
Type: List
Default:
- ecs.t6-c4m1.large
- ecs.t5-lc1m1.small
- ecs.t5-lc1m2.small
- ecs.s6-c1m1.small
- ecs.s6-c1m2.small
- ecs.n1.small
- ecs.mn4.small
- ecs.e3.small
- ecs.e4.small
- ecs.n2.small
- ecs.n4.small
- ecs.t5-lc1m2.large
- ecs.c6.large
- ecs.sn2ne.large
OOSAssumeRole:
Label:
en: OOSAssumeRole
zh-cn: OOS扮演的RAM角色
Type: String
Default: ''
RamRole: '{{ OOSAssumeRole }}'
Tasks:
- Name: checkInstanceReady
Action: 'ACS::CheckFor'
Description:
en: Checks ECS instance is linux os
zh-cn: 確認將救治ECS執行個體為Linux系統的
OnError: ACS::END
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ unreachableInstanceId }}'
DesiredValues:
- linux
PropertySelector: 'Instances.Instance[].OSType'
Outputs:
status:
Type: String
ValueSelector: 'Instances.Instance[].Status'
vSwitchId:
Type: String
ValueSelector: 'Instances.Instance[].VpcAttributes.VSwitchId'
zoneId:
Type: String
ValueSelector: 'Instances.Instance[].ZoneId'
oSNameEn:
Type: String
ValueSelector: 'Instances.Instance[].OSNameEn'
oSName:
Type: String
ValueSelector: 'Instances.Instance[].OSName'
imageID:
Type: String
ValueSelector: 'Instances.Instance[].ImageId'
- Name: querySystemDisks
Action: 'ACS::CheckFor'
Description:
en: Checks system disk of the ECS instance
zh-cn: 檢查將救治的系統硬碟情況
OnError: ACS::END
Properties:
Service: ECS
API: DescribeDisks
Parameters:
InstanceId: '{{ unreachableInstanceId }}'
DiskType: system
DesiredValues:
- 'true'
PropertySelector: '.Disks.Disk[] as $disk|$disk.Category|startswith("cloud") and ($disk.Encrypted|not)|tostring'
Outputs:
diskId:
Type: String
ValueSelector: 'Disks.Disk[].DiskId'
category:
Type: String
ValueSelector: 'Disks.Disk[].Category'
encrypted:
Type: String
ValueSelector: 'Disks.Disk[].Encrypted'
- Name: whetherStopInstance
Action: 'ACS::Choice'
Description:
en: Choose next task by Instance status
zh-cn: 根據執行個體狀態選擇要執行的任務
Properties:
DefaultTask: stopInstance
Choices:
- When:
'Fn::Equals':
- Stopped
- '{{ checkInstanceReady.status }}'
NextTask: checkAvailableInstanceTypesExist
- Name: stopInstance
Action: 'ACS::ExecuteAPI'
Description:
en: Stops the ECS instances
zh-cn: 停止執行個體
Properties:
Service: ECS
API: StopInstance
Parameters:
InstanceId: '{{ unreachableInstanceId }}'
- Name: Sleep3Minutes
Description:
en: Wait instance Stopped
zh-cn: 等待執行個體停止成功
Action: 'ACS::Sleep'
Properties:
Duration: PT3M
- Name: queryUnreachableInstanceStatus
Action: 'ACS::ExecuteAPI'
Description:
en: Query status of unreachable instance
zh-cn: 查詢損傷系統硬碟的執行個體狀態
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ unreachableInstanceId }}'
Outputs:
status:
Type: String
ValueSelector: 'Instances.Instance[].Status'
- Name: whetherForceStopInstance
Action: 'ACS::Choice'
Description:
en: Choose next task by Instance status
zh-cn: 根據執行個體狀態選擇要執行的任務
Properties:
DefaultTask: forceStopInstance
Choices:
- When:
'Fn::Equals':
- Stopped
- '{{ queryUnreachableInstanceStatus.status }}'
NextTask: checkAvailableInstanceTypesExist
- Name: forceStopInstance
Action: 'ACS::ExecuteAPI'
Description:
en: Stops the ECS instances forcibly
zh-cn: 強制停止執行個體
Properties:
Service: ECS
API: StopInstance
Parameters:
InstanceId: '{{ unreachableInstanceId }}'
ForceStop: 'true'
- Name: untilStopUnreachableInstanceSuccess
Action: 'ACS::WaitFor'
Description:
en: Waits for the ECS instance to enter stopped status
zh-cn: 等待執行個體停止
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ unreachableInstanceId }}'
DesiredValues:
- Stopped
PropertySelector: 'Instances.Instance[].Status'
- Name: checkAvailableInstanceTypesExist
Action: 'ACS::Template'
OnError: ACS::END
Description:
en: Query current available instance type for creating helper instance in the zone of the unreachable
zh-cn: 查詢將建立的目標臨時執行個體規格是否有庫存
Properties:
TemplateName: 'ACS::ECS::CheckAvailableInstanceTypes'
Parameters:
zoneId: '{{ checkInstanceReady.zoneId }}'
instanceTypes: '{{ helperInstanceTypes }}'
Outputs:
availableInstanceType:
Type: String
ValueSelector: '.availableInstanceTypes[0]'
- Name: createImage
Action: 'ACS::ExecuteAPI'
Description:
en: Creates a custom image
zh-cn: 建立一個自訂鏡像
Properties:
Service: ECS
API: CreateImage
Parameters:
ImageName: '{{imagePrefix}}{{ ACS::ExecutionId }}'
InstanceId: '{{ unreachableInstanceId }}'
DetectionStrategy: Standard
Tag:
- Key: 'instance_to_rescue'
Value: '{{unreachableInstanceId}}'
- Key: 'oos_exec'
Value: '{{ ACS::ExecutionId }}'
Outputs:
imageId:
Type: String
ValueSelector: ImageId
- Name: createStack
Action: 'ACS::ExecuteAPI'
Description:
en: Create a Ros resource stack
zh-cn: 建立Ros資源棧
Properties:
Service: ROS
API: CreateStack
Parameters:
StackName: 'OOS-{{ACS::ExecutionId}}'
TimeoutInMinutes: 10
DisableRollback: false
Parameters:
- ParameterKey: helperInstanceType
ParameterValue: '{{checkAvailableInstanceTypesExist.availableInstanceType}}'
- ParameterKey: zoneId
ParameterValue: '{{ checkInstanceReady.zoneId }}'
- ParameterKey: resourcePrefix
ParameterValue: 'OOS-{{ACS::ExecutionId}}'
- ParameterKey: imageId
ParameterValue: 'centos_8_0_x64_20G_alibase_20191225.vhd'
- ParameterKey: instanceIdToRescue
ParameterValue: '{{unreachableInstanceId}}'
- ParameterKey: executionId
ParameterValue: '{{ ACS::ExecutionId }}'
TemplateURL: 'https://oos-debug.oss-cn-hangzhou.aliyuncs.com/ros_template.json'
Outputs:
StackId:
Type: String
ValueSelector: StackId
- Name: untilImageReady
Action: ACS::WaitFor
Description:
en: Wait for the image to be available
zh-cn: 等待鏡像建立成功
OnError: deleteStack
Properties:
Service: ECS
API: DescribeImages
Parameters:
ImageId: '{{ createImage.imageId }}'
DesiredValues:
- Available
PropertySelector: Images.Image[].Status
Retries: 50
Delay: 36
DelayType: Constant
- Name: untilStackReady
Action: 'ACS::WaitFor'
OnError: queryStackStatusReason
OnSuccess: putRTToHelperInstance
Description:
en: Wait for the stack status CREATE_COMPLETE.
zh-cn: 等待資源棧建立成功。
Properties:
Service: ROS
API: GetStack
Parameters:
StackId: '{{createStack.StackId}}'
DesiredValues:
- CREATE_COMPLETE
StopRetryValues:
- CREATE_FAILED
- CHECK_FAILED
- ROLLBACK_FAILED
- ROLLBACK_COMPLETE
- CREATE_ROLLBACK_COMPLETE
PropertySelector: Status
Outputs:
helperInstanceId:
Type: String
ValueSelector: 'Outputs[0].OutputValue'
statusReason:
Type: String
ValueSelector: 'StatusReason'
- Name: queryStackStatusReason
Action: ACS::ExecuteAPI
OnError: deleteStack
OnSuccess: deleteStack
Description:
en: Query the reson of failed created stack.
zh-cn: 查詢資源棧未建立成功的原因。
Properties:
Service: ROS
API: GetStack
Parameters:
StackId: '{{createStack.StackId}}'
Outputs:
statusReason:
Type: String
ValueSelector: 'StatusReason'
- Name: putRTToHelperInstance
Action: 'ACS::ECS::RunCommand'
OnError: deleteStack
Description:
en: Run cloud assistant command on ECS instance to download rt
zh-cn: 在執行個體中運行雲助手命令下載修複指令碼
Properties:
commandContent: 'cd /tmp ; wget https://oos-debug.oss-cn-hangzhou.aliyuncs.com/guestos-scripts-0.0.1.tar.gz; tar -zxvf guestos-scripts-0.0.1.tar.gz'
commandType: RunShellScript
instanceId: '{{ untilStackReady.helperInstanceId }}'
- Name: addTags
Action: ACS::ExecuteAPI
OnError: deleteStack
Description:
en: Add Tags of system disk to instance to rescue
zh-cn: 給要救治的執行個體添加上其系統硬碟資訊的標籤
Properties:
Service: ECS
API: AddTags
Parameters:
ResourceType: instance
ResourceId: '{{ unreachableInstanceId }}'
Tag:
- Key: 'source_sys_disk'
Value: '{{ querySystemDisks.diskId }}'
- Name: detachDisk
Action: 'ACS::ECS::DetachDisk'
OnError: deleteStack
Description:
en: Detaches the system disk from unreachable instance
zh-cn: 卸載有損傷的系統硬碟
Properties:
instanceId: '{{ unreachableInstanceId }}'
diskId: '{{ querySystemDisks.diskId }}'
- Name: attachAsDataDisk
Action: 'ACS::ECS::AttachDisk'
OnError: deleteStack
Description:
en: Attaches the system disk to the helper instance as a data disk
zh-cn: 將損傷的系統硬碟作為資料盤掛載到臨時執行個體上
Properties:
instanceId: '{{ untilStackReady.helperInstanceId }}'
diskId: '{{ querySystemDisks.diskId }}'
- Name: runCommand
Action: 'ACS::ECS::RunCommand'
OnError: deleteStack
Description:
en: Run a cloud assistant command of rescuing disk on ECS instance
zh-cn: 在執行個體中通過雲助手運行救治損傷盤的指令碼
Properties:
commandContent: cd /tmp/guestos-scripts-0.0.1;./rescue_system_disk.sh
commandType: RunShellScript
instanceId: '{{ untilStackReady.helperInstanceId }}'
Outputs:
commandOutput:
Type: String
ValueSelector: invocationOutput
- Name: forceStopHelperInstance
Action: 'ACS::ExecuteAPI'
OnError: deleteStack
Description:
en: Stops the helper instance forcibly
zh-cn: 強制停止執行個體
Properties:
Service: ECS
API: StopInstance
Parameters:
InstanceId: '{{ untilStackReady.helperInstanceId }}'
ForceStop: 'true'
- Name: untilforceStopHelperInstanceSuccess
Action: 'ACS::WaitFor'
OnError: deleteStack
Description:
en: Waits for the helper instance to enter stopped status
zh-cn: 等待臨時執行個體停止
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ untilStackReady.helperInstanceId }}'
DesiredValues:
- Stopped
PropertySelector: 'Instances.Instance[].Status'
- Name: detachHelperInstanceDataDisk
Action: 'ACS::ECS::DetachDisk'
OnError: deleteStack
Description:
en: Detaches data disk from the helper instance
zh-cn: 卸載臨時執行個體的資料盤
Properties:
instanceId: '{{ untilStackReady.helperInstanceId }}'
diskId: '{{ querySystemDisks.diskId }}'
- Name: untilUnreachableInstanceSystemDiskAvailable
Action: 'ACS::WaitFor'
OnError: 'ACS::NEXT'
Description:
en: Waits for the disk to be detached
zh-cn: 等待磁碟卸載成功
Properties:
Service: ECS
API: DescribeDisks
Parameters:
DiskIds:
- '{{ querySystemDisks.diskId }}'
DesiredValues:
- Available
PropertySelector: 'Disks.Disk[].Status'
- Name: deleteStack
Action: 'ACS::ExecuteApi'
OnError: 'ACS::NEXT'
Description:
en: Delete the ros resource stack
zh-cn: 刪除Ros資源棧
Properties:
Service: ROS
API: DeleteStack
Parameters:
StackId: '{{createStack.StackId}}'
- Name: untilStackDeleted
Action: 'ACS::WaitFor'
OnError: 'ACS::NEXT'
Description:
en: Wait for the ros stack status DELETE_COMPLETE
zh-cn: 等待Ros資源棧至刪除成功
Properties:
Service: ROS
API: GetStack
Parameters:
StackId: '{{createStack.StackId}}'
DesiredValues:
- DELETE_COMPLETE
StopRetryValues:
- DELETE_FAILED
- CHECK_FAILED
PropertySelector: Status
- Name: checkForUnreachableInstanceSystemDiskAvailable
Action: 'ACS::CheckFor'
OnError: 'ACS::END'
Description:
en: Check for the disk to be detached
zh-cn: 檢查損傷的系統硬碟是否可掛載
Properties:
Service: ECS
API: DescribeDisks
Parameters:
DiskIds:
- '{{ querySystemDisks.diskId }}'
DesiredValues:
- Available
PropertySelector: 'Disks.Disk[].Status'
- Name: whetherCredentialTypeIsKeyPairName
Action: 'ACS::Choice'
OnError: 'ACS::NEXT'
Description:
en: Choose next task by credential type input
zh-cn: 根據輸入的登入憑證類型確定後續任務
Properties:
DefaultTask: attachAsSysDiskWithKeyPairName
Choices:
- When:
'Fn::Equals':
- Password
- '{{ credentialType }}'
NextTask: attachAsSysDisk
- Name: attachAsSysDiskWithKeyPairName
Action: 'ACS::ExecuteAPI'
OnSuccess: untilDiskAttached
OnError: 'ACS::NEXT'
Description:
en: Attaches the source system disk to unreachable instance and set PairName credential type for root
zh-cn: 將救治過的損傷系統硬碟掛回原執行個體,並且為root設定金鑰組形式的登入憑證
Properties:
Service: ECS
API: AttachDisk
Parameters:
DiskId: '{{ querySystemDisks.diskId }}'
InstanceId: '{{ unreachableInstanceId }}'
Bootable: 'true'
KeyPairName: '{{credentialValue}}'
- Name: attachAsSysDisk
Action: 'ACS::ExecuteAPI'
OnError: 'ACS::NEXT'
Description:
en: Attaches the source system disk to unreachable instance and set Password credential type for root
zh-cn: 將救治過的損傷系統硬碟掛回原執行個體,並且為root設定自訂密碼形式的登入憑證
Properties:
Service: ECS
API: AttachDisk
Parameters:
DiskId: '{{ querySystemDisks.diskId }}'
InstanceId: '{{ unreachableInstanceId }}'
Bootable: 'true'
Password: '{{credentialValue}}'
- Name: untilDiskAttached
Action: 'ACS::WaitFor'
OnError: 'ACS::NEXT'
Description:
en: Waits for the system disk to be attached
zh-cn: 等待系統硬碟掛回原執行個體成功
Retries: 7
Properties:
Service: ECS
API: DescribeDisks
Parameters:
DiskIds:
- '{{ querySystemDisks.diskId }}'
DesiredValues:
- In_use
PropertySelector: 'Disks.Disk[].Status'
- Name: whetherStartUnreachableInstance
Action: 'ACS::Choice'
OnError: 'ACS::NEXT'
Description:
en: Choose next task by original instance status
zh-cn: 根據執行個體初始狀態選擇後續任務
Properties:
DefaultTask: ACS::END
Choices:
- When:
'Fn::Equals':
- Running
- '{{ checkInstanceReady.status }}'
NextTask: startUnreachableInstance
- Name: startUnreachableInstance
Action: 'ACS::ECS::StartInstance'
Description:
en: Starts the unreachable instance
zh-cn: 啟動被救治的執行個體
Properties:
instanceId: '{{ unreachableInstanceId}}'
Outputs:
diskId:
Type: String
Value: '{{ querySystemDisks.diskId }}'
imageId:
Type: String
Value: '{{ createImage.imageId }}'
rtCommandOutput:
Type: String
Value: '{{ runcommand.commandOutput }}'
finalHelperInstanceType:
Type: String
Value: '{{checkAvailableInstanceTypesExist.availableInstanceType}}'
Metadata:
ALIYUN::OOS::Interface:
ParameterGroups:
- Parameters:
- credentialType
- credentialValue
- imagePrefix
- helperInstanceTypes
Label:
default:
zh-cn: 設定參數
en: Configure Parameters
- Parameters:
- unreachableInstanceId
Label:
default:
zh-cn: 選擇執行個體
en: Select ECS Instance
- Parameters:
- OOSAssumeRole
Label:
default:
zh-cn: 進階選項
en: Control Options