This topic provides answers to some frequently asked questions (FAQ) about Elastic Block Storage (EBS) async replication and continuous data replication (CDR) for Elastic Compute Service (ECS) disaster recovery.
EBS async replication
What instance specifications are supported by EBS async replication? What are the limits?
EBS async replication is compatible with most instance types.
The EBS async replication feature has the following limits:
Limits on regions and zones: The EBS async replication feature is in public preview. More regions will be available in the future. The regions and zones that you can select when you create a site pair shall prevail. For more information, see Limits on regions and zones.
Limits on disk specifications: Enterprise SSDs (ESSDs) and ESSD AutoPL disks are supported (excluding ESSD entry disks). For more information, see Limits on specifications and ECS disks.
Limits on ECS networks:
Single elastic network interfaces (ENI):
After a failover, the ENI cannot be automatically configured for some operating systems at the disaster recovery site. After the failover, check and configure the ENI at the disaster recovery site to ensure that the network works properly. For more information, see Configure a secondary ENI.
Multiple ENIs:
After an ECS instance is bound to a secondary ENI, some images cannot automatically identify the IP address of the secondary ENI or add a route. As a result, the secondary ENI cannot work properly.
If an ECS instance is configured with a secondary ENI, check the IP address of the secondary ENI after a failover. This ensures that the secondary ENI works as expected. For more information, see Configure a secondary ENI.
Only ENIs and ECS instances that reside in the same virtual private clouds (VPCs) as the disaster recovery site pair are supported.
Where can I change the IP address of the disaster recovery site for EBS async replication?
On the Network Information tab of the instance details page, you can change the IP address of the disaster recovery site.
In the Preview Basic Information panel, you can change the IP address of the disaster recovery site.
Does EBS async replication support ECS configuration changes?
Before a protection group is replicated for the first time, ECS configuration changes are supported in the following scenarios:
If the number and total capacity of disks in the protection group do not exceed the limits during the initial settings of the protection group, the system automatically synchronizes the existing configurations to the disaster recovery site, as well as the configuration changes of new disks.
During system downtime, you can mount, unmount, and scale up disks, change disk names, and roll back disks based on snapshots.
Based on the mappings between vSwitches and security groups, you can add and remove instances for security groups and unbind instances and modify security groups for ENIs.
ECS configuration changes are not supported in the following scenarios:
If the protection group enters the running state, any configuration changes at the production site or the disaster recovery site may affect the failover and failback.
If you perform an unsupported operation, system exceptions may occur and alerts are triggered.
When an exception occurs or an alert is triggered, evaluate your business requirements and proceed with caution. You can refer to the following solutions:
During forward replication
If you need to modify configurations during the forward replication, we recommend that you suspend the replication, remove the affected protected instances, and then add the instances as required to ensure data synchronization consistency.
During reverse replication
During reverse replication, we recommend that you remove the related protected instances and then create a new disaster recovery site to ensure data security and service continuity.
What do I do if I cannot select an instance when I add instances for EBS async replication?
EBS async replication has limits on regions, zones, disk types, networks, and configuration quotas. You can click the icon on the left of the instance ID to view the reason why protection cannot be enabled and troubleshoot the exception accordingly. For more information, see Limits.
What do I do if the instance type of the disaster recovery site is abnormal when I enable EBS async replication?
This issue occurs because the instance types of the protected instance are unavailable or insufficient at the disaster recovery site. We recommend that you perform the Change Instance Type operation in the console to change the instance type as required. If an exception occurs on the operating system or IP address, you can perform the Modify Operating System or Modify Disaster Recovery IP operation to change the operating system or IP address as required.
What do I do if a protection group is in the Failed to Enable Replication, Failover Failed, or Failback Failed state?
Assume that the protection group is in the Failover Failed state.
In the console, click the ID of the failed task as prompted. On the Tasks tab, view the detailed error cause.
For example,
Not have any stock of instance type family ...
indicates that the instance type family does not exist. In this case, perform the Change Instance Type at DR Site operation on the Protected Instances page, and then retry the task.
What are the differences between CDR and EBS async replication for ECS disaster recovery?
EBS async replication is a feature that protects data across regions or across zones within the same region based on the data replication capability of EBS. For more information, see Overview of disk disaster recovery.
The following table describes the differences between CDR and EBS async replication.
Item | CDR | EBS async replication |
Scenarios | Disaster recovery for a single virtual machine (VM). If you do not mind intrusions into the system, you can use this replication technology. | Disaster recovery that ensures the consistency of VM groups. If you do not expect intrusions into the system, you can use this replication technology. |
Intrusive to the system | Yes | No |
Replication implementation | An agent is installed on the operating system of the protected instance, so that Cloud Backup replicates data written to the disks and sends the data to a gateway in real time. The gateway stores the data in an Object Storage Service (OSS) bucket and then writes the data to the disk at the disaster recovery site. | Data is replicated by using the EBS async replication and snapshot features. |
Recovery implementation | Supports multiple recovery points. A shadow ECS instance and a gateway server are created for the protected ECS instance at the disaster recovery site. Cloud Backup reads data from the OSS bucket to the shadow ECS instance, writes the data to the ECS instance at the disaster recovery site, and then creates a recovery point based on the snapshot mechanism. | Supports only a single recovery point. Cloud Backup creates a recovery point by replicating the snapshot to the disaster recovery site. |
Consistency group | Not supported | Supported |
What do I do if the operating system of an ECS instance for disaster recovery or drill fails to start?
Check whether the ECS instance at the production site is started by using basic input/output system (BIOS) or Unified Extensible Firmware Interface (UEFI). The two boot methods cannot be used together. To ensure that the operating system of the ECS instance can start properly during disaster recovery, you must use an image that is started the same way as the ECS instance at the production site to create an ECS instance for disaster recovery. This prevents the issue that the operating system cannot start due to different boot methods. For more information, see Boot modes of ECS instances.
How do I calculate the amount of data that is replicated during EBS async replication?
EBS async replication is divided into forward replication and reverse replication.
Forward replication
Forward replication includes the initial full replication and subsequent incremental replications.
During the full replication, all the data of a disk is replicated, and the data size is equal to the total storage capacity of the disk.
An incremental replication is performed every 15 minutes after the full replication is complete. Only the data that has been changed during the period is replicated. The size of the replicated data depends on the size of changed data during the period.
Reverse replication
Only data replication to the original instance is supported. Full replication is not supported. Reverse replication is started after a failover. The system performs an incremental replication every 15 minutes. The size of the replicated data depends on the size of changed data during the period.
What are the retention policies for the ECS instance at the disaster recovery site and the associated disks?
Long-term retention
After an ECS instance at the disaster recovery site and the associated disks are created, they will be retained until the corresponding protected instance is manually removed from the production site. By default, resources are not automatically reclaimed.
Retention policy in case of failover
If the service is activated at the disaster recovery site (the status of the protection group is Failover Completed or Failover Confirmed), the ECS instances and disks at the disaster recovery site are not automatically reclaimed even if you choose to remove the protected instances.
Management of fault drill resources
The ECS instances and disks that are created during a fault drill are automatically reclaimed after the drill group is deleted.
Impact on resources at the production site
ECS disaster recovery does not reclaim the ECS instances and disks at the production site.
Billing methods of ECS instances and disks
ECS billing is related to the stop mode selected when you perform startup, replication, failover, and failback operations. For more information, see Economical mode. For more information about the discounts, see Reserved Instance and Savings Plan.
The EBS async replication feature supports the subscription and pay-as-you-go billing methods. For more information, see Billing.
Can I directly start the ECS instance at the disaster recovery site or the production site without performing a failover or failback?
No, you cannot. When the protection group is in the Reverse Replication state (during replication or after the replication is stopped), the disks at the destination site are automatically set to the read-only mode. This prevents data inconsistency caused by unexpected startup of the destination ECS instance. If you want to verify the effectiveness of the disaster recovery solution, we recommend that you use the disaster recovery drill feature. If you want to restore your workloads quickly, we recommend that you perform a formal failover or failback.
What is the RTO of EBS async replication?
A recovery time objective (RTO) refers to the time required from executing a disaster recovery plan to restoring a business system so that the business system can meet the preset goals of level of service after a disaster occurs.
The failover and failback in EBS async replication require RTOs in minutes, from when the failover or failback is started to when the ECS instance is started.
The RTOs required for disaster recovery drills are positively correlated with the data amount of the ECS system disk. Depending on the complexity of services, the time for service startup and detection completion must also be considered for RTOs.
CDR
What are the RPO and RTO of CDR?
Core business data is replicated from the self-managed data centers of enterprises to the cloud in real time, achieving recovery point objectives (RPOs) in seconds to minutes. If a major failure occurs on a self-managed data center, data is replicated from the self-managed data center to the cloud within a few minutes, achieving RTOs in minutes.
Which operating systems are supported by CDR?
CDR supports mainstream Windows and Linux operating systems. For more information, see Operating systems.
The following table describes the operating systems that support ECS disaster recovery (CDR).
Only the operating systems listed in the following table are supported. For other operating systems, we recommend that you use the async replication feature.
Operating system | Version |
Windows Server | 2008 R2, 2012, 2012 R2, and 2016 |
Linux | Important You must make sure that the /boot partition and the / partition reside on the same disk. If the partitions do not reside on the same disk, move the partitions to the same disk, and then register the ECS instance for which you want to enable ECS disaster recovery (CDR).
|
What are the snapshot retention policies for CDR?
The recovery points for ECS disaster recovery use the snapshot feature of shadow disks to ensure that the servers protected by disaster recovery can be restored to a specified historical version.
The following items show the snapshot retention policies:
If a recovery point has been used for disaster recovery drills or failover, it is not restricted by these snapshot retention policies.
All recovery points of the last day are retained.
For example, the current UTC time is 2020-10-12T17:00:00Z, and the duration of the last day is from 2020-10-11T00:00:00Z to 2020-10-12T17:00:00Z, containing a total of 41 hours.
The last recovery point of each day in the last week is retained.
The last recovery point of each week in the last month is retained.
All recovery points are retained for a month. Expired recovery points are cleared.
Does CDR support scale-up of or disk adding on a source ECS instance?
Only a source Linux ECS instance in the site pairs for cross-region and cross-zone cloud disaster recovery support scale-up or disk adding.
After the source ECS instance is scaled up or disks are added, ECS disaster recovery can detect disk changes within 5 minutes. ECS disaster recovery stops the ongoing server replication, adjusts the capacity of the destination shadow disks, repairs the replication, and then resumes real-time replication. This process depends on the disk size and may last for a long period of time. You can observe the status change from Repair Replication to Replicating in the console. The process is performed automatically.
ECS disaster recovery does not support scale-in or disk reduction on the source ECS instance because the operation may lead to replication errors or data loss.