Template name
ACS-ECS-RescueUnreachableInstance-Linux
Template description
When you use an Elastic Compute Service (ECS) instance, it may become unavailable due to several causes. One possible cause is system disk damage, which can occur if the instance is forcibly stopped or restarted, or experiences a sudden breakdown. Other possible causes include failing to update the /etc/fstab file after a data disk is detached from the instance, or loss or corruption of the /etc/fstab or initrd file. Even if the instance becomes unavailable, it may still appear as Running in the ECS console. However, you may experience the following issues: failure to access applications on the instance, disconnection from the network of the instance, and failure to connect to the instance by using Workbench or SSH. If you can establish a Virtual Network Computing (VNC) connection to the instance in the ECS console, you may encounter a message indicating a system startup failure. In this case, you can execute this template to repair the damaged instance. The repair process involves attaching the system disk of the damaged instance to a temporary instance. A repair script is then run on the temporary instance. Finally, the repaired system disk is reattached to the original instance.
Template type
Automated
Owner
Alibaba Cloud
Input parameters
Parameter | Description | Type | Required | Default value | Limit |
unreachableInstanceId | The ID of the ECS instance to be repaired. | String | Yes | ||
credentialType | The type of the credential that is used to log on to the ECS instance. | String | Yes | ||
credentialValue | The credential that is used to log on to the ECS instance. | String | Yes | ||
imagePrefix | The name prefix of the image that is used to back up data of the ECS instance. | String | No | OOSRescueBackup- | |
helperInstanceTypes | The instance types that are available to the temporary ECS instance that is used to repair the damaged ECS instance. | List | No | ['ecs.t6-c4m1.large', 'ecs.t5-lc1m1.small', 'ecs.t5-lc1m2.small', 'ecs.s6-c1m1.small', 'ecs.s6-c1m2.small', 'ecs.n1.small', 'ecs.mn4.small', 'ecs.e3.small', 'ecs.e4.small', 'ecs.n2.small', 'ecs.n4.small', 'ecs.t5-lc1m2.large', 'ecs.c6.large', 'ecs.sn2ne.large'] | |
OOSAssumeRole | The Resource Access Management (RAM) role that is assumed by CloudOps Orchestration Service (OOS). | String | No | "" |
Output parameters
Parameter | Description | Type |
diskId | String | |
imageId | String | |
rtCommandOutput | String | |
finalHelperInstanceType | String |
Permission policy that is required to execute the template
{
"Version": "1",
"Statement": [
{
"Action": [
"ecs:AddTags",
"ecs:AttachDisk",
"ecs:CreateImage",
"ecs:DescribeAvailableResource",
"ecs:DescribeDisks",
"ecs:DescribeImages",
"ecs:DescribeInstances",
"ecs:DescribeInvocationResults",
"ecs:DescribeInvocations",
"ecs:DetachDisk",
"ecs:RunCommand",
"ecs:StartInstance",
"ecs:StopInstance"
],
"Resource": "*",
"Effect": "Allow"
},
{
"Action": [
"ros:CreateStack",
"ros:DeleteStack",
"ros:GetStack"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
References
For more information, see ACS-ECS-RescueUnreachableInstance-Linux.yml at GitHub.
Template content
FormatVersion: OOS-2019-06-01
Description:
en: 'When using ECS instances, some situations can lead to system disk corruption, such as instances being forced to stop or restart, or sudden downtime, failure to update /etc/fstab when the disk is unloaded, or even loss or corruption of /etc/fstab or initrd files. When the instance cannot be accessed, the state of the instance displayed in the ECS instance console may still be Running, but the application in the instance cannot be accessed, the network in the instance cannot be reached, and the connection cannot be established through workbench or SSH. If you can connect to an instance from the console with VNC, you will probably see a page that indicates a system startup failure. At this point, you can consider executing the template to cure the damaged instance. The cure process is that the system disk of the damaged instance will be mounted to the newly created temporary instance, then a cure script will be executed in the temporary instance, and finally the cured system disk will be mounted back to the original instance.'
zh-cn: the description in Chinese
name-en: ACS-ECS-RescueUnreachableInstance-Linux
name-zh-cn: the description in Chinese
categories:
- diagnose
Parameters:
unreachableInstanceId:
Label:
en: UnreachableInstanceId
zh-cn: the description in Chinese
Type: String
AssociationProperty: ALIYUN::ECS::Instance::InstanceId
AssociationPropertyMetadata:
RegionId: '{{ ACS::RegionId }}'
credentialType:
Description:
en: 'Credential Type for your unreachable ECS instance after being rescued, either KeyPairName or Password type can be chosen.'
zh-cn: the description in Chinese
Label:
en: CredentialType
zh-cn: the description in Chinese
Type: String
AllowedValues:
- KeyPairName
- Password
credentialValue:
Description:
en: 'Credential value for your unreachable ECS instance after being rescued, the value of KeyPairName or Password.'
zh-cn: the description in Chinese
Type: String
Label:
en: Credential
zh-cn: the description in Chinese
imagePrefix:
Label:
en: ImagePrefix
zh-cn: the description in Chinese
Type: String
Default: OOSRescueBackup-
helperInstanceTypes:
Label:
en: HelperInstanceTypes
zh-cn: the description in Chinese
Type: List
Default:
- ecs.t6-c4m1.large
- ecs.t5-lc1m1.small
- ecs.t5-lc1m2.small
- ecs.s6-c1m1.small
- ecs.s6-c1m2.small
- ecs.n1.small
- ecs.mn4.small
- ecs.e3.small
- ecs.e4.small
- ecs.n2.small
- ecs.n4.small
- ecs.t5-lc1m2.large
- ecs.c6.large
- ecs.sn2ne.large
OOSAssumeRole:
Label:
en: OOSAssumeRole
zh-cn: the description in Chinese
Type: String
Default: ''
RamRole: '{{ OOSAssumeRole }}'
Tasks:
- Name: checkInstanceReady
Action: 'ACS::CheckFor'
Description:
en: Checks ECS instance is linux os
zh-cn: the description in Chinese
OnError: ACS::END
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ unreachableInstanceId }}'
DesiredValues:
- linux
PropertySelector: 'Instances.Instance[].OSType'
Outputs:
status:
Type: String
ValueSelector: 'Instances.Instance[].Status'
vSwitchId:
Type: String
ValueSelector: 'Instances.Instance[].VpcAttributes.VSwitchId'
zoneId:
Type: String
ValueSelector: 'Instances.Instance[].ZoneId'
oSNameEn:
Type: String
ValueSelector: 'Instances.Instance[].OSNameEn'
oSName:
Type: String
ValueSelector: 'Instances.Instance[].OSName'
imageID:
Type: String
ValueSelector: 'Instances.Instance[].ImageId'
- Name: querySystemDisks
Action: 'ACS::CheckFor'
Description:
en: Checks system disk of the ECS instance
zh-cn: the description in Chinese
OnError: ACS::END
Properties:
Service: ECS
API: DescribeDisks
Parameters:
InstanceId: '{{ unreachableInstanceId }}'
DiskType: system
DesiredValues:
- 'true'
PropertySelector: '.Disks.Disk[] as $disk|$disk.Category|startswith("cloud") and ($disk.Encrypted|not)|tostring'
Outputs:
diskId:
Type: String
ValueSelector: 'Disks.Disk[].DiskId'
category:
Type: String
ValueSelector: 'Disks.Disk[].Category'
encrypted:
Type: String
ValueSelector: 'Disks.Disk[].Encrypted'
- Name: whetherStopInstance
Action: 'ACS::Choice'
Description:
en: Choose next task by Instance status
zh-cn: the description in Chinese
Properties:
DefaultTask: stopInstance
Choices:
- When:
'Fn::Equals':
- Stopped
- '{{ checkInstanceReady.status }}'
NextTask: checkAvailableInstanceTypesExist
- Name: stopInstance
Action: 'ACS::ExecuteAPI'
Description:
en: Stops the ECS instances
zh-cn: the description in Chinese
Properties:
Service: ECS
API: StopInstance
Parameters:
InstanceId: '{{ unreachableInstanceId }}'
- Name: Sleep3Minutes
Description:
en: Wait instance Stopped
zh-cn: the description in Chinese
Action: 'ACS::Sleep'
Properties:
Duration: PT3M
- Name: queryUnreachableInstanceStatus
Action: 'ACS::ExecuteAPI'
Description:
en: Query status of unreachable instance
zh-cn: the description in Chinese
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ unreachableInstanceId }}'
Outputs:
status:
Type: String
ValueSelector: 'Instances.Instance[].Status'
- Name: whetherForceStopInstance
Action: 'ACS::Choice'
Description:
en: Choose next task by Instance status
zh-cn: the description in Chinese
Properties:
DefaultTask: forceStopInstance
Choices:
- When:
'Fn::Equals':
- Stopped
- '{{ queryUnreachableInstanceStatus.status }}'
NextTask: checkAvailableInstanceTypesExist
- Name: forceStopInstance
Action: 'ACS::ExecuteAPI'
Description:
en: Stops the ECS instances forcibly
zh-cn: the description in Chinese
Properties:
Service: ECS
API: StopInstance
Parameters:
InstanceId: '{{ unreachableInstanceId }}'
ForceStop: 'true'
- Name: untilStopUnreachableInstanceSuccess
Action: 'ACS::WaitFor'
Description:
en: Waits for the ECS instance to enter stopped status
zh-cn: the description in Chinese
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ unreachableInstanceId }}'
DesiredValues:
- Stopped
PropertySelector: 'Instances.Instance[].Status'
- Name: checkAvailableInstanceTypesExist
Action: 'ACS::Template'
OnError: ACS::END
Description:
en: Query current available instance type for creating helper instance in the zone of the unreachable
zh-cn: the description in Chinese
Properties:
TemplateName: 'ACS::ECS::CheckAvailableInstanceTypes'
Parameters:
zoneId: '{{ checkInstanceReady.zoneId }}'
instanceTypes: '{{ helperInstanceTypes }}'
Outputs:
availableInstanceType:
Type: String
ValueSelector: '.availableInstanceTypes[0]'
- Name: createImage
Action: 'ACS::ExecuteAPI'
Description:
en: Creates a custom image
zh-cn: the description in Chinese
Properties:
Service: ECS
API: CreateImage
Parameters:
ImageName: '{{imagePrefix}}{{ ACS::ExecutionId }}'
InstanceId: '{{ unreachableInstanceId }}'
DetectionStrategy: Standard
Tag:
- Key: 'instance_to_rescue'
Value: '{{unreachableInstanceId}}'
- Key: 'oos_exec'
Value: '{{ ACS::ExecutionId }}'
Outputs:
imageId:
Type: String
ValueSelector: ImageId
- Name: createStack
Action: 'ACS::ExecuteAPI'
Description:
en: Create a Ros resource stack
zh-cn: the description in Chinese
Properties:
Service: ROS
API: CreateStack
Parameters:
StackName: 'OOS-{{ACS::ExecutionId}}'
TimeoutInMinutes: 10
DisableRollback: false
Parameters:
- ParameterKey: helperInstanceType
ParameterValue: '{{checkAvailableInstanceTypesExist.availableInstanceType}}'
- ParameterKey: zoneId
ParameterValue: '{{ checkInstanceReady.zoneId }}'
- ParameterKey: resourcePrefix
ParameterValue: 'OOS-{{ACS::ExecutionId}}'
- ParameterKey: imageId
ParameterValue: 'centos_8_0_x64_20G_alibase_20191225.vhd'
- ParameterKey: instanceIdToRescue
ParameterValue: '{{unreachableInstanceId}}'
- ParameterKey: executionId
ParameterValue: '{{ ACS::ExecutionId }}'
TemplateURL: 'https://oos-debug.oss-cn-hangzhou.aliyuncs.com/ros_template.json'
Outputs:
StackId:
Type: String
ValueSelector: StackId
- Name: untilImageReady
Action: ACS::WaitFor
Description:
en: Wait for the image to be available
zh-cn: the description in Chinese
OnError: deleteStack
Properties:
Service: ECS
API: DescribeImages
Parameters:
ImageId: '{{ createImage.imageId }}'
DesiredValues:
- Available
PropertySelector: Images.Image[].Status
Retries: 50
Delay: 36
DelayType: Constant
- Name: untilStackReady
Action: 'ACS::WaitFor'
OnError: queryStackStatusReason
OnSuccess: putRTToHelperInstance
Description:
en: Wait for the stack status CREATE_COMPLETE.
zh-cn: the description in Chinese
Properties:
Service: ROS
API: GetStack
Parameters:
StackId: '{{createStack.StackId}}'
DesiredValues:
- CREATE_COMPLETE
StopRetryValues:
- CREATE_FAILED
- CHECK_FAILED
- ROLLBACK_FAILED
- ROLLBACK_COMPLETE
- CREATE_ROLLBACK_COMPLETE
PropertySelector: Status
Outputs:
helperInstanceId:
Type: String
ValueSelector: 'Outputs[0].OutputValue'
statusReason:
Type: String
ValueSelector: 'StatusReason'
- Name: queryStackStatusReason
Action: ACS::ExecuteAPI
OnError: deleteStack
OnSuccess: deleteStack
Description:
en: Query the reson of failed created stack.
zh-cn: the description in Chinese
Properties:
Service: ROS
API: GetStack
Parameters:
StackId: '{{createStack.StackId}}'
Outputs:
statusReason:
Type: String
ValueSelector: 'StatusReason'
- Name: putRTToHelperInstance
Action: 'ACS::ECS::RunCommand'
OnError: deleteStack
Description:
en: Run cloud assistant command on ECS instance to download rt
zh-cn: the description in Chinese
Properties:
commandContent: 'cd /tmp ; wget https://oos-debug.oss-cn-hangzhou.aliyuncs.com/guestos-scripts-0.0.1.tar.gz; tar -zxvf guestos-scripts-0.0.1.tar.gz'
commandType: RunShellScript
instanceId: '{{ untilStackReady.helperInstanceId }}'
- Name: addTags
Action: ACS::ExecuteAPI
OnError: deleteStack
Description:
en: Add Tags of system disk to instance to rescue
zh-cn: the description in Chinese
Properties:
Service: ECS
API: AddTags
Parameters:
ResourceType: instance
ResourceId: '{{ unreachableInstanceId }}'
Tag:
- Key: 'source_sys_disk'
Value: '{{ querySystemDisks.diskId }}'
- Name: detachDisk
Action: 'ACS::ECS::DetachDisk'
OnError: deleteStack
Description:
en: Detaches the system disk from unreachable instance
zh-cn: the description in Chinese
Properties:
instanceId: '{{ unreachableInstanceId }}'
diskId: '{{ querySystemDisks.diskId }}'
- Name: attachAsDataDisk
Action: 'ACS::ECS::AttachDisk'
OnError: deleteStack
Description:
en: Attaches the system disk to the helper instance as a data disk
zh-cn: the description in Chinese
Properties:
instanceId: '{{ untilStackReady.helperInstanceId }}'
diskId: '{{ querySystemDisks.diskId }}'
- Name: runCommand
Action: 'ACS::ECS::RunCommand'
OnError: deleteStack
Description:
en: Run a cloud assistant command of rescuing disk on ECS instance
zh-cn: the description in Chinese
Properties:
commandContent: cd /tmp/guestos-scripts-0.0.1;./rescue_system_disk.sh
commandType: RunShellScript
instanceId: '{{ untilStackReady.helperInstanceId }}'
Outputs:
commandOutput:
Type: String
ValueSelector: invocationOutput
- Name: forceStopHelperInstance
Action: 'ACS::ExecuteAPI'
OnError: deleteStack
Description:
en: Stops the helper instance forcibly
zh-cn: the description in Chinese
Properties:
Service: ECS
API: StopInstance
Parameters:
InstanceId: '{{ untilStackReady.helperInstanceId }}'
ForceStop: 'true'
- Name: untilforceStopHelperInstanceSuccess
Action: 'ACS::WaitFor'
OnError: deleteStack
Description:
en: Waits for the helper instance to enter stopped status
zh-cn: the description in Chinese
Properties:
Service: ECS
API: DescribeInstances
Parameters:
InstanceIds:
- '{{ untilStackReady.helperInstanceId }}'
DesiredValues:
- Stopped
PropertySelector: 'Instances.Instance[].Status'
- Name: detachHelperInstanceDataDisk
Action: 'ACS::ECS::DetachDisk'
OnError: deleteStack
Description:
en: Detaches data disk from the helper instance
zh-cn: the description in Chinese
Properties:
instanceId: '{{ untilStackReady.helperInstanceId }}'
diskId: '{{ querySystemDisks.diskId }}'
- Name: untilUnreachableInstanceSystemDiskAvailable
Action: 'ACS::WaitFor'
OnError: 'ACS::NEXT'
Description:
en: Waits for the disk to be detached
zh-cn: the description in Chinese
Properties:
Service: ECS
API: DescribeDisks
Parameters:
DiskIds:
- '{{ querySystemDisks.diskId }}'
DesiredValues:
- Available
PropertySelector: 'Disks.Disk[].Status'
- Name: deleteStack
Action: 'ACS::ExecuteApi'
OnError: 'ACS::NEXT'
Description:
en: Delete the ros resource stack
zh-cn: the description in Chinese
Properties:
Service: ROS
API: DeleteStack
Parameters:
StackId: '{{createStack.StackId}}'
- Name: untilStackDeleted
Action: 'ACS::WaitFor'
OnError: 'ACS::NEXT'
Description:
en: Wait for the ros stack status DELETE_COMPLETE
zh-cn: the description in Chinese
Properties:
Service: ROS
API: GetStack
Parameters:
StackId: '{{createStack.StackId}}'
DesiredValues:
- DELETE_COMPLETE
StopRetryValues:
- DELETE_FAILED
- CHECK_FAILED
PropertySelector: Status
- Name: checkForUnreachableInstanceSystemDiskAvailable
Action: 'ACS::CheckFor'
OnError: 'ACS::END'
Description:
en: Check for the disk to be detached
zh-cn: the description in Chinese
Properties:
Service: ECS
API: DescribeDisks
Parameters:
DiskIds:
- '{{ querySystemDisks.diskId }}'
DesiredValues:
- Available
PropertySelector: 'Disks.Disk[].Status'
- Name: whetherCredentialTypeIsKeyPairName
Action: 'ACS::Choice'
OnError: 'ACS::NEXT'
Description:
en: Choose next task by credential type input
zh-cn: the description in Chinese
Properties:
DefaultTask: attachAsSysDiskWithKeyPairName
Choices:
- When:
'Fn::Equals':
- Password
- '{{ credentialType }}'
NextTask: attachAsSysDisk
- Name: attachAsSysDiskWithKeyPairName
Action: 'ACS::ExecuteAPI'
OnSuccess: untilDiskAttached
OnError: 'ACS::NEXT'
Description:
en: Attaches the source system disk to unreachable instance and set PairName credential type for root
zh-cn: the description in Chinese
Properties:
Service: ECS
API: AttachDisk
Parameters:
DiskId: '{{ querySystemDisks.diskId }}'
InstanceId: '{{ unreachableInstanceId }}'
Bootable: 'true'
KeyPairName: '{{credentialValue}}'
- Name: attachAsSysDisk
Action: 'ACS::ExecuteAPI'
OnError: 'ACS::NEXT'
Description:
en: Attaches the source system disk to unreachable instance and set Password credential type for root
zh-cn: the description in Chinese
Properties:
Service: ECS
API: AttachDisk
Parameters:
DiskId: '{{ querySystemDisks.diskId }}'
InstanceId: '{{ unreachableInstanceId }}'
Bootable: 'true'
Password: '{{credentialValue}}'
- Name: untilDiskAttached
Action: 'ACS::WaitFor'
OnError: 'ACS::NEXT'
Description:
en: Waits for the system disk to be attached
zh-cn: the description in Chinese
Retries: 7
Properties:
Service: ECS
API: DescribeDisks
Parameters:
DiskIds:
- '{{ querySystemDisks.diskId }}'
DesiredValues:
- In_use
PropertySelector: 'Disks.Disk[].Status'
- Name: whetherStartUnreachableInstance
Action: 'ACS::Choice'
OnError: 'ACS::NEXT'
Description:
en: Choose next task by original instance status
zh-cn: the description in Chinese
Properties:
DefaultTask: ACS::END
Choices:
- When:
'Fn::Equals':
- Running
- '{{ checkInstanceReady.status }}'
NextTask: startUnreachableInstance
- Name: startUnreachableInstance
Action: 'ACS::ECS::StartInstance'
Description:
en: Starts the unreachable instance
zh-cn: the description in Chinese
Properties:
instanceId: '{{ unreachableInstanceId}}'
Outputs:
diskId:
Type: String
Value: '{{ querySystemDisks.diskId }}'
imageId:
Type: String
Value: '{{ createImage.imageId }}'
rtCommandOutput:
Type: String
Value: '{{ runcommand.commandOutput }}'
finalHelperInstanceType:
Type: String
Value: '{{checkAvailableInstanceTypesExist.availableInstanceType}}'
Metadata:
ALIYUN::OOS::Interface:
ParameterGroups:
- Parameters:
- credentialType
- credentialValue
- imagePrefix
- helperInstanceTypes
Label:
default:
zh-cn: the description in Chinese
en: Configure Parameters
- Parameters:
- unreachableInstanceId
Label:
default:
zh-cn: the description in Chinese
en: Select ECS Instance
- Parameters:
- OOSAssumeRole
Label:
default:
zh-cn: the description in Chinese
en: Control Options