Use replication pair-consistent groups to implement disaster recovery - Elastic Compute Service

If the disks at the primary site (primary disks) in an activated replication pair-consistent group fail, you can perform a failover and grant read and write permissions on the disks at the secondary site (secondary disks), and then attach the secondary disks to temporary Elastic Compute Service (ECS) instances to ensure business continuity. After you repair the primary disks, perform a reverse replication to replicate data from the secondary disks to the primary disks for disaster recovery and business continuity. This topic describes how to use a replication pair-consistent group to implement disaster recovery for multiple disks.

Note

When you perform failover and reverse replication operations by using a replication pair-consistent group, the operations apply to all replication pairs in the replication pair-consistent group.

Prerequisites

Make sure that the primary disks are detached from the associated ECS instances and are in the Unattached state. For more information, see Detach a data disk. Alternatively, make sure that the ECS instances to which the primary disks are attached are in the Stopped state.
Note
This ensures that the primary disks are read-only when the system reversely replicates data from the secondary disks to the primary disks, which prevents a reverse replication failure.
We recommend that you create snapshots for the disks to back up disk data. For information about how to create snapshots of disks to back up disk data, see Create a snapshot.
Note
Snapshots are billed when you create them. For information about the billing of snapshots, see Snapshots.

(Optional) Step 1: Perform a disaster recovery drill

After you activate async replication, a replication pair-consistent group continuously replicates data from the primary disks at the primary site to the secondary disks at the secondary site. You can use the disaster recovery drill feature to clone data from the secondary disks to new disks (drill disks) to verify the integrity and correctness of applications at the secondary site. A disaster recovery drill does not affect the async replication feature, and a fault of a disk at the primary site does not affect the drill. However, a fault of a disk at the secondary site may cause an exception on the drill.

Log on to the Elastic Block Storage (EBS) console.
In the left-side navigation pane, choose Enterprise-level Features > Replication Pair-Consistent Group.
In the top navigation bar, select the region and resource group to which the resource belongs.
On the Replication Pair-Consistent Group page, find the replication pair-consistent group on which you want to perform a disaster recovery drill and click the group ID.
In the Drills section, click Create Drill.
In the Create Drill message, confirm the region and zone of the replication pair-consistent group and click OK.
After you create a disaster recovery drill, a disk is created for each secondary disk in the zone in which the replication pair-consistent group resides. The new disks have the same configurations as the secondary disks. The number of the new disks is the same as the number of the replication pairs in the replication pair-consistent group. The new disks contain data at the most recent recovery point on the secondary disks. The data can be used to verify the integrity and correctness of the applications.
Note
- You can create multiple disaster recovery drills to back up data at different recovery points based on your business requirements.
- After the disaster recovery drill is complete, we recommend that you delete the drill and the drill disks in the Drills section at the earliest opportunity to reduce costs.

Step 2: Perform a failover

Warning

A failover may suspend async replication. To prevent data loss, perform a failover only if a primary disk fails.

In the top navigation bar, switch to the region in which the secondary site resides. Example: China (Beijing) region.
In the replication pair-consistent group list, find the replication pair-consistent group to which the primary site belongs and choose > Failover in the Actions column.
Note
Alternatively, click the ID of the replication pair-consistent group. On the group details page, click Failover in the upper-right corner of the page.
In the message that appears, read the notes and click OK.
After the failover is performed, Failed Over is displayed in the Status column of the replication pair-consistent group.
Attach the disks at the secondary site to temporarily created ECS instances to allow the business to continue to run.
For more information, see Create an instance on the Custom Launch tab and Attach a data disk.

Step 3: Perform a reverse replication

Warning

After a reverse replication is performed, the original data on the disks at the primary site is overwritten by disk data at the secondary site. To prevent historical data loss, we recommend that you create snapshots for the disks at the primary site. For information about how to create snapshots from disks to back up disk data, see Create a snapshot.

In the top navigation bar, switch the region to the region in which the secondary site resides. Example: China (Beijing) region.
In the replication pair-consistent group list, find the replication pair-consistent group on which you performed a failover and choose > Reverse Replication in the Actions column.
Note
Alternatively, click the ID of the replication pair-consistent group. On the group details page, click Reverse replication in the upper-right corner of the page.
In the Reverse replication message, read the notes and click Confirm.
The replication pair-consistent group enters the Stopped state, and the primary/secondary relationship between the primary site and the secondary site is reversed.
Note
The original primary site automatically changes to the secondary site, and the original secondary site automatically changes to the primary site. Example:
- Before a reverse replication is performed, the site in the China (Beijing) region is the primary site and the site in the China (Shanghai) region is the secondary site.
- After the reverse replication is performed, the site in the China (Shanghai) region becomes the primary site, and the site in the China (Beijing) region becomes the secondary site.
Find the replication pair-consistent group for which you performed the reverse replication and click Activate in the Actions column.
In this case, async replication is enabled to asynchronously replicate data from the disks at the original secondary site to the disks at the original primary site.
After the replication pair-consistent group enters the Normal state, data on the disks at the original secondary site is asynchronously replicated to the disks at the original primary site. This indicates that disaster recovery is complete.
(Optional) Restore the initial relationship between the primary site and secondary site in the replication pair-consistent group.
The primary/secondary relationship between the primary site and secondary site in the replication pair-consistent group is reversed when you performed the reverse replication. To restore the initial primary/secondary relationship in your business environment, perform the following operations:
1. View the region information in the Secondary Region/Zone column of the current replication pair-consistent group. In the top navigation bar, switch to the region.
2. Find the replication pair-consistent group on which you performed the reverse replication. In the Actions column, choose > Failover to perform a failover.
3. Choose > Reverse replication in the Actions column to perform a reverse replication.
4. After the reverse replication is complete, click Activate in the Actions column to re-activate the replication pair-consistent group.
5. In the replication pair-consistent group list, check whether the primary/secondary relationship between the primary site and the secondary site is restored based on the Primary Region/Zone and Secondary Region/Zone columns.