The disaster recovery system is deployed in two regions of Alibaba Cloud. If a fault occurs at the production site due to natural disasters such as tsunamis and earthquakes, the business system is switched to the disaster recovery site. Elastic Compute Service (ECS) disaster recovery provides a highly reliable disaster recovery solution by deploying the production and disaster recovery sites in different regions. To ensure business continuity, the ECS disaster recovery feature allows you to achieve a recovery point objective (RPO) of 1 minute and a recovery time objective (RTO) of 15 minutes. The feature helps you effectively prevent system failures that are caused by regional disasters.
Before you begin
Before you implement cross-region disaster recovery, you must select a region to deploy the disaster recovery system. The region must be different from the region where the production environment is deployed. You must create a virtual private cloud (VPC) in the region. In addition, you must create a vSwitch for replication and a vSwitch for restoration in the VPC.
Step 1: Create a disaster recovery site pair
To create a disaster recovery site pair that provides cross-region disaster recovery protection for ECS instances at the production site, perform the following steps:
Log on to the Cloud Backup console.
In the left-side navigation pane, choose .
Click Switch to CDR.
In the upper-right corner of the Disaster Recovery Center page, click + Add.
In the Create Site Pair panel, configure the parameters and then click Create.
Select Cross-region Disaster Recovery for Type.
Configure the production site information.
The production site is used to specify the location of the server that requires disaster recovery on the cloud.
Parameter
Description
Name
Enter a name for the production site. For example, you can enter Hangzhou Production Site. The name can be up to 60 characters in length. The name must meet the following requirements:
The name cannot start with a special character or digit.
The name can contain only the following special characters: periods (.), underscores (_), and hyphens (-).
Region
Select the region where the production site resides from the Region drop-down list. For example, you can select China (Hangzhou).
VPC
Select the VPC that is created for the production site from the VPC drop-down list. For example, you can select defaultvpc.
Configure the disaster recovery site information.
The computing and storage resources that are used by the disaster recovery site are created in the specified VPC.
Parameter
Description
Name
Enter a name for the disaster recovery site. For example, you can enter Shanghai Disaster Recovery Site. The name can be up to 60 characters in length. The name must meet the following requirements:
The name cannot start with a special character or digit.
The name can contain only the following special characters: periods (.), underscores (_), and hyphens (-).
Region
Select the region where the disaster recovery site resides from the Region drop-down list. For example, you can select China (Shanghai).
VPC
Select the VPC where the disaster recovery site resides from the VPC drop-down list. For example, you can select Default VPC.
Step 2: Add the ECS instances to be protected
After the disaster recovery site pair is created, perform the following steps to add the ECS instances to be protected:
Click the Protected Server tab. In the upper-right corner of this tab, select the disaster recovery site pair that you created in Step 1 from the drop-down list.
On the Protected Server tab, click + Add. Select the ECS instances and click OK.
You can select 1 to 10 ECS instances.
In the Server Status column, the status of the added ECS instances is Agent Installing and then changes to Initialized. If the status of an ECS instance is not Initialized, choose
in the Operation column to initialize the instance.
Step 3: Start replication
To enable real-time replication of ECS instances to Alibaba Cloud, perform the following steps:
On the Protected Server tab, find the ECS instance that you want to replicate and choose Operation column. in the
In the Enable Replication panel, configure the parameters and click Start.
Parameter
Description
Recovery Point Policy
Select the interval at which recovery points are created from the drop-down list. Unit: hours. For example, if you select 1 hour, Cloud Backup creates a recovery point every hour.
Hard Disk Type
You can select Ultra Disk, ESSD, or SSD.
Replication Network
Select a replication network from the drop-down list. Cloud Backup uses this network to replicate disaster recovery data to the cloud.
By default, Cloud Backup reads the available vSwitches of the VPC network at the disaster recovery site. If the replication network and the recovery network share the same vSwitch, data can be restored with higher speed. If the replication network and the recovery network are not in the same zone, the RTO becomes longer. We recommend that you configure the same zone for the replication network and the recovery network.
Recovery Network
Select a recovery network from the drop-down list. During disaster recovery, Cloud Backup uses this network to restore disaster recovery data. For example, Cloud Backup uses this network to create an ECS instance during a disaster recovery drill or failover.
By default, Cloud Backup reads the available vSwitches of the VPC network at the disaster recovery site. If the replication network and the recovery network share the same vSwitch, data can be restored with higher speed. If the replication network and the recovery network are not in the same zone, the RTO becomes longer. We recommend that you configure the same zone for the replication network and the recovery network.
Automatic restart after replication interruption
Specify whether to automatically resume replication if an interruption occurs. If you select this check box, the replication task is restarted after the replication is interrupted.
The ECS instance then enters the Enabling Replication, Replicating Full Data, and Replicating states in sequence.
Enabling Replication: ECS disaster recovery is scanning data on the ECS instance and evaluating the overall data volume. In most cases, this process takes a few minutes.
Replicating Full Data: ECS disaster recovery is replicating valid data from the ECS instance to Alibaba Cloud. The replication duration depends on factors such as the data volume and the network bandwidth of the ECS instance. The progress bar in the Server Status column shows the replication progress.
Replicating: After all valid data on the ECS instance is replicated to Alibaba Cloud, Aliyun Replication Service (AReS) monitors all write operations that are performed on the disks of the ECS instance and replicates the incremental data to Alibaba Cloud in real time.
(Optional) Perform a disaster recovery drill
After an ECS instance enters the Replicating state, you can perform a disaster recovery drill on the ECS instance.
A disaster recovery drill is an important part of disaster recovery. It allows you to run a protected ECS instance on the cloud to verify that your applications can run as expected. A disaster recovery drill has the following benefits:
Allows you to easily check whether an application can run on a restored ECS instance as expected.
Familiarizes yourself with the disaster recovery process and ensures that a smooth failover can be performed if the production site encounters a failure.
To perform a disaster recovery drill, perform the following steps:
On the Protected Server tab, find the ECS instance on which you want to perform a disaster recovery drill and click Test Failover in the Operation column.
In the Test Failover panel, configure the following parameters: Recovery Network, IP Address, Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script. Then, click Start.
NoteCloud Backup automatically retains 24 recovery points that are created in the most recent 24 hours for each ECS instance.
If you do not select Use ECS Specification, you must set the CPU and Memory parameters.
Alibaba Cloud then runs the application on a restored ECS instance at the specified time. The disaster recovery drill does not affect real-time data replication.
After the disaster recovery drill is completed within a few minutes, click the link in the Test Failover Information column to verify restored data and applications.
Clear the drill environment.
After the verification is completed, click Cleanup Test Environment in the Operation column. Then, the restored ECS instance is deleted.
NoteAfter the restored ECS instance is verified, we recommend that you delete the restored ECS instance at the earliest opportunity to reduce costs.
Step 4: Perform a failover
Regular disaster recovery drills ensure that you can run your applications on restored ECS instances at any time. If a critical error occurs at the production site, you can switch your workloads to the disaster recovery site.
Failover is applicable to protected ECS instances where a critical error occurs. During the failover, ECS disaster recovery stops real-time data replication. To resume replication for a protected ECS instance, you must choose More > Server Operation > Restart Replication in the Operation column.
To initiate a failover, perform the following steps:
On the Protected Server tab, find the ECS instance and choose Operation column. in the
In the Failover panel, configure the following parameters: Recovery Network, IP Address, Use ECS Specification, Hard Disk Type, Recovery Point, Elastic Public Network IP, and Post Script. Then, click Start.
ImportantYou can restore the ECS instance to the current point in time only once.
After the failover is completed, click the link in the Recovered Instance ID/Name column to verify the restored data and applications.
If the applications run as expected after being restored to the current point in time, choose
in the Operation column.NoteAfter you complete the failover or change the recovery point and verify that the applications restored from the protected ECS instance are running your business, you can commit the failover to release the cloud resources that are occupied during failover to save resources.
If the applications do not meet the requirements after being restored to the current point in time, for example, data in the restored database is inconsistent with that in the source database or dirty data on the source ECS instance is synchronized to the restored ECS instance in the destination region, choose
in the Operation column to change the recovery point before you commit the failover.
NoteThe procedure for changing the recovery point is similar to that for failover, except that you must select a recovery point earlier than the current point in time.
Step 5: Perform a reverse replication
After you replicate applications on a protected ECS instance from Region A to Region B, you can also perform a reverse replication to replicate applications from Region B to Region A.
To perform a reverse replication, perform the following steps:
On the Protected Server tab, find the ECS instance and choose Operation column. In the message that appears, confirm that you want to perform a reverse registration on the ECS instance. in the
In the Operation column, choose .
In the Initiate Reverse Replication panel, configure the following parameters: Original machine recovery, Replication Network, and Recovery Network. Then, click Start.
WarningCross-region disaster recovery and cross-zone disaster recovery allow you to replicate applications back to the original ECS instance. However, when you replicate applications back to the original ECS instance, data on the original ECS instance is overwritten. Perform this operation with caution.
After the ECS instance enters the Reverse Replicating Data in Real Time state, choose Operation column. in the
In the Failback panel, configure the following parameters: CPU, Memory, Recovery Network, IP, and Execute script after recovery. Then, click Start.
After the failback is completed, choose Operation column to register the protected ECS instance again. in the