All Products
Search
Document Center

Container Service for Kubernetes:FAQs about the backup center

更新時間:Dec 20, 2024

This topic provides answers to some frequently asked questions (FAQs) about the backup center.

Table of contents

Category

Issue

Obtain error messages

Common operations

Console

Common operations

Backup

StorageClass conversion (FKA snapshot creation)

Restoration

Others

Common operations

Note

If you use the kubectl CLI to access the backup center, you need to update the backup component named migrate-controller to the latest version before troubleshooting issues. The update does not affect existing backups. For more information about how to update the component, see Manage components.

If the status of your backup, StorageClass conversion (FKA snapshot creation), or restore task is Failed or PartiallyFailed, you can use the following methods to obtain the error message.

  • Move the pointer over Failed or PartiallyFailed in the Status column to view the brief error information, such as RestoreError: snapshot cross region request failed.image.png

  • To view the detailed error information, run one of the following commands to query the resource events of the task, such as RestoreError: process advancedvolumesnapshot failed avs: snapshot-hz, err: transition canceled with error: the ECS-snapshot related ram policy is missing.

    • Backup tasks

      kubectl -n csdr describe applicationbackup <backup-name> 
    • StorageClass conversion (FKA snapshot creation) tasks

      kubectl -n csdr describe converttosnapshot <backup-name>
    • Restore tasks

      kubectl -n csdr describe applicationrestore <restore-name>

What do I do if the console prompts "The components are abnormal." or "Failed to retrieve the current data."?

Symptoms

The console prompts The components are abnormal. or Failed to retrieve the current data.

Causes

The installation of the backup center component is abnormal.

Solutions

  • Check whether the cluster nodes where the backup center component is installed exist. If cluster nodes do not exist, the backup center component cannot be installed.

  • Check whether the cluster uses FlexVolume. If the cluster uses FlexVolume, switch to Container Storage Interface (CSI). For more information, see What do I do if the migrate-controller component in a cluster that uses FlexVolume cannot be launched?

  • If you use the kubectl CLI to access the backup center, check whether the YAML configuration contains errors. For more information, see Use kubectl to back up and restore applications.

  • If your cluster is an ACK dedicated cluster or registered cluster, check whether the required permissions are configured. For more information, see ACK dedicated cluster and Registered cluster.

  • Check whether the csdr-controller and csdr-velero Deployments in the csdr namespace fail to be deployed due to resource or scheduling limits. If yes, fix the issue.

What do I do if the console displays the following error: The name is already used. Change the name and try again?

Symptoms

When you create or delete a backup, StorageClass conversion (FKA snapshot creation), or restore task, the console displays The name is already used. Change the name and try again.

Causes

When you delete a task in the console, a deletrequest resource is created in the cluster. The corresponding component performs multiple delete operations, including deleting the backup resources. For more information about how to use kubectl to perform relevant operations, see Use kubectl to back up and restore data.

If errors occur when the component performs delete operations or processes the deleterequest resource, some resources in the cluster are retained. Consequently, a resource with the same name may exist.

Solutions

  • Delete the resource with the same name as prompted. For example, if the deleterequests.csdr.alibabacloud.com "xxxxx-dbr" already exists error is displayed, you can run the following command to delete the resource:

    kubectl -n csdr delete deleterequests xxxxx-dbr
  • Create a task with a new name.

What do I do if no existing backup can be selected when I restore an application across clusters?

Symptoms

When you restore an application across clusters, no existing backup can be selected to restore the application.

Causes

  • Cause 1: The backup vault is not associated with the current cluster, which means that the backup vault is not initialized.

    When the system initializes the backup vault, it synchronizes the basic information about the backup vault, including the Object Storage Service (OSS) bucket information, to the cluster. Then, the system initializes the backup files from the backup vault in the cluster. You can select a backup file from the backup vault to restore the application only after the backup vault is initialized.

  • Cause 2: The initialization of the backup vault fails, which means that the backuplocation resource in the current cluster is in the Unavailable state.

  • Cause 3: The backup task has not been completed or the backup task failed.

Solutions

  • Solution 1:

In the Create Restoration Task panel, click Initialize Backup Vault to the right of Backup Vaults, wait until the backup vault is initialized, and then select a backup file.

  • Solution 2:

Run the following command to query the status of the backuplocation resource:

kubectl get -ncsdr backuplocation <backuplocation-name> 

Expected output:

NAME                    PHASE       LAST VALIDATED   AGE
<backuplocation-name>   Available   3m36s            38m

If the status is Unavailable, refer to What do I do if the status of the task is Failed and the "VaultError: xxx" error is returned?

Solution 3:

Log on to the console of the backup cluster and check whether the status of the backup task is Completed. If the status of the backup task is abnormal, troubleshoot the issue. For more information, see Table of contents.

What do I do if the console prompts that the dependent service-linked role of the current component is not assigned?

Symptoms

When you access the application backup page, the console prompts that the dependent service-linked role of the current component is not assigned. The error code AddonRoleNotAuthorized is displayed.

Causes

The cloud resource authentication logic is optimized in the backup center component migrate-controller 1.8.0 for ACK managed clusters. If this is the first time you install or update to this component version with your Alibaba Cloud account, you must complete cloud resource authorization for your account.

Solutions

  • If you log on to the console with an Alibaba Cloud account, click Copy Authorization Link and click Authorize to grant permissions to your Alibaba Cloud account.

  • If you log on to the console with a Resource Access Management (RAM) user, click Copy Authorization Link and send the link to the corresponding Alibaba Cloud account to complete authorization.

What do I do if my account does not have the required RBAC permissions to perform an operation?

Symptoms

If your account does not have the required role-based access control (RBAC) permissions to perform an operation in the backup center, contact the Alibaba Cloud account or permission administrator to obtain the required permissions. The error code is APISERVER. 403.

Causes

The console sends tasks such as backup and restoration by interacting with the API server and obtains the task status in real time. The default cluster O&M engineers and developers do not have some permissions on the backup center component. You must obtain the required permissions from the Alibaba Cloud account or permission administrator.

Solutions

Grant the following ClusterRole permissions to the account. For more information, see Use custom RBAC roles to restrict resource operations in a cluster.

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: csdr-console
rules:
  - apiGroups: ["csdr.alibabacloud.com","velero.io"]
    resources: ['*']
    verbs: ["get","create","delete","update","patch","watch","list","deletecollection"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get","list"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get","list"]

What do I do if the backup center component fails to be upgraded or uninstalled?

Symptoms

The backup center component fails to be upgraded or uninstalled, and the csdr namespace remains in the Terminating state.

Causes

The backup center encounters errors, such as crashes. As a result, tasks that are stuck in the InProgress state exist in the csdr namespace. Such tasks correspond to the finalizers field and cause resource deletion failures. The csdr namespace remains in the Terminating state.

Solutions

  • Run the following command to query the reason why the csdr namespace remains in the Terminating state:

    kubectl describe ns csdr

    Delete the finalizers of tasks that you no longer need.

  • Perform the following operations after the csdr namespace is deleted:

    • If you want to upgrade the backup center component, re-install the migrate-controller component.

    • If you want to uninstall the backup center component, the component is already deleted.

What do I do if the status of the task is Failed and the "internal error" error is returned?

Symptoms

The status of the task is Failed and the "internal error" error is returned.

Causes

The component or underlying dependent cloud services experience exceptions. For example, the cloud services are not available in the selected region.

Solutions

If the "HBR backup/restore internal error" error is returned, log on to the Cloud Backup console to check whether the container backup feature is enabled.

If the issue persists, submit a ticket.

What do I do if the status of the task is Failed and the "create cluster resources timeout" error is returned?

Symptoms

The status of the task is Failed and the "create cluster resources timeout" error is returned.

Causes

When the system runs a StorageClass conversion (FKA snapshot creation) or restore task, it may create temporary pods, persistent volume claims (PVCs), and persistent volumes (PVs). The "create cluster resources timeout" error is returned if these resources remain unavailable for a long period of time.

Solutions

  1. Run the following command to locate the abnormal resource and find the cause based on the events:

    kubectl -ncsdr describe <applicationbackup/converttosnapshot/applicationrestore> <task-name> 

    Expected output:

    ……wait for created tmp pvc default/demo-pvc-for-convert202311151045 for convertion bound time out

    The output indicates that the PVC used to convert the StorageClass remains in a state other than Bound for a long period of time. The namespace of the PVC is default and the name of the PVC is demo-pvc-for-convert202311151045.

  2. Run the following command to query the status of the PVC and locate the cause:

    kubectl -ndefault describe pvc demo-pvc-for-convert202311151045 

    The following list describes the common reasons that cause backup center-relevant issues. For more information, see Storage troubleshooting.

    • Cluster or node resources are insufficient or abnormal.

    • The StorageClass does not exist in the restore cluster. In this case, create a StorageClass conversion (FKA snapshot creation) task to convert the current StorageClass to an existing StorageClass in the restore cluster.

    • The storage resource associated with the StorageClass is unavailable. For example, the specified disk type is not supported in the current zone.

    • The Container Network File System (CNFS) system associated with alibabacloud-cnfs-nas is abnormal. For more information, see Use CNFS to manage NAS file systems (recommended).

    • You selected a StorageClass whose volumeBindingMode is Immediate when you restore an application in a multi-zone cluster.

What do I do if the status of the task is Failed and the "addon status is abnormal" error is returned?

Symptoms

The status of the task is Failed and the "addon status is abnormal" error is returned.

Causes

The components in the csdr namespace are abnormal.

Solutions

For more information, see Cause 1 and solution: The components in the csdr namespace are abnormal.

What do I do if the status of the task is Failed and the "VaultError: xxx" error is returned?

Symptoms

The status of the backup, restore, or snapshot conversion task is Failed, and the VaultError: backup vault is unavailable: xxx error is displayed.

Causes

  • The specified OSS bucket does not exist.

  • The cluster does not have permissions to access OSS.

  • The network of the OSS bucket is unreachable.

Solutions

  1. Log on to the OSS console. Check whether the OSS bucket that is associated with the backup vault exists.

    If the OSS bucket does not exist, create one and associate it with the backup vault. For more information, see Create buckets.

  2. Check whether the cluster has permissions to access OSS.

    • Container Service for Kubernetes (ACK) Pro clusters: No OSS permissions are required. Make sure that the name of the backup vault is in the cnfs-oss-** format.

    • ACK dedicated clusters and registered clusters: OSS permissions are required. For more information, see Install migrate-controller and grant permissions.

    If you use methods other than the console to install migrate-controller 1.8.0 or update to this version in an ACK managed cluster, OSS permissions may be missing. You can run the following command to check whether a cluster has the permissions to access OSS:

    kubectl get secret -n kube-system | grep addon.aliyuncsmanagedbackuprestorerole.token

    Expected output:

    addon.aliyuncsmanagedbackuprestorerole.token          Opaque                      1      62d

    If the returned content is the same as the preceding expected output, the cluster has permissions to access OSS. You need only to specify an OSS bucket that is named in the cnfs-oss-* format for the cluster.

    If the preceding content is not returned, use the following method to complete authorization.

    • Grant OSS permissions to ACK dedicated clusters and registered clusters. For more information, see Install migrate-controller and grant permissions.

    • If you use an Alibaba Cloud account, click Authorize to complete authorization. You only need to perform the authorization once for each Alibaba Cloud account.

    Note

    You cannot create a backup vault that uses the same name as a deleted one. You cannot associate a backup vault with an OSS bucket that is not named in the cnfs-oss-** format. If your backup vault is already associated with an OSS bucket that is not named in the cnfs-oss-** format, create another backup vault that uses a different name and associate the backup vault with an OSS bucket whose name meets the requirement.

  3. Run the following command to check the network configuration of the cluster:

    kubectl get backuplocation <backuplocation-name> -n csdr -o yaml | grep network

    Expected output:

    network: internal
    • When network is set to internal, the backup vault accesses the OSS bucket over the internal network.

    • When network is set to public, the backup vault accesses the OSS bucket over the Internet. If the backup vault accesses the OSS bucket over the Internet and an error indicating that the access times out is returned, check whether the cluster can access the Internet. For more information, see Enable an existing ACS cluster to access the Internet.

    In the following scenarios, you must configure the backup vault to access the OSS bucket over the Internet:

    • The cluster and OSS bucket are deployed in different regions.

    • The cluster is an ACK Edge cluster.

    • The cluster is a registered cluster and is not connected to a virtual private cloud (VPC) through Cloud Enterprise Network (CEN), Express Connect, or VPN connections, or the cluster is a registered cluster connected to a VPC but no route points to the internal network of the region where the OSS bucket resides. In this case, you must configure a route that points to the internal network of the region where the OSS bucket resides.

    To configure the cluster to access the OSS bucket over the Internet, run the following command to enable Internet access for the OSS bucket. Replace <backuplocation-name> with the actual backup vault name and <region-id> with the region ID of the OSS bucket, such as cn-hangzhou.

    kubectl patch -ncsdr backuplocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'
    kubectl patch -ncsdr backupstoragelocation/<backuplocation-name> --type='json' -p   '[{"op":"add","path":"/spec/config","value":{"network":"public","region":"<region-id>"}}]'

What do I do if the status of the backup, restore, or snapshot conversion task is Failed, and the backup location is not ok, please maybe check access oss error is returned?

Symptoms

The status of the backup, restore, or snapshot conversion task is Failed, and the "backup location is not ok, please check access oss" error is returned.

Causes and solutions

Kubernetes 1.20 and later

Cause of failure

The version of migrate-controller is outdated.

Solutions

Update migrate-controller to the latest version. For more information, see Manage components.

Kubernetes versions earlier than 1.20

Possible causes

  • The OSS subdirectory that is associated with a backup vault cannot be a parent or child directory of the OSS subdirectory that is associated with another backup vault. For example, you cannot use directories / and /A or directories /A and /A/B at the same time. In addition, the OSS subdirectories that are associated with backup vaults can store only backups generated by the backup center. If you store other data in the OSS subdirectories, the backup vault becomes unavailable.

  • The same cause described in What do I do if the status of the task is Failed and the "VaultError: xxx" error is returned?

Solutions

The OSS subdirectory that is associated with a backup vault cannot be a parent or child directory of the OSS subdirectory that is associated with another backup vault. In addition, the OSS subdirectories that are associated with backup vaults can store only backups generated by the backup center. Run the following command to check the OSS subdirectories. Replace <backuplocation-name> with the actual backup vault name.

kubectl describe backupstoragelocation <backuplocation-name> -n csdr | grep message

Expected output:

Backup store contains invalid top-level directories: ****

The output indicates that other data is stored in the OSS directories associated with the backup vault. To resolve this issue, use one of the following methods:

  • Update the Kubernetes version of the cluster to Kubernetes 1.20 or later and update migrate-controller to the latest version.

  • Create a backup vault that is not associated with an OSS subdirectory and rename the backup vault. Do not delete the backup vaults that have the same name.

What do I do if the backup, restore, or snapshot conversion task remains in the Inprogress state for a long period of time?

Cause 1 and solution: The components in the csdr namespace are abnormal

Check the status of the components and identify the cause.

  1. Run the following command to check whether the components in the csdr namespace are restarted or cannot be launched:

    kubectl get pod -n csdr
  2. Run the following command to locate the cause:

    kubectl describe pod <pod-name> -n csdr
  • If the components are restarted due to an out of memory (OOM) error, perform the following steps:

    • Check whether the OOM error occurs during the restoration process on the pod csdr-velero-***, and whether a large number of applications, such as applications in dozens of production namespaces, are running in the cluster in recovery. If yes, the cause of error may be the informer cache, which is used by Velero by default to accelerate the restoration process.

      If the number of cluster resources to be restored is small, or you can tolerate a certain degree of performance degradation during the restoration process, run the following command to disable the informer cache feature:

      kubectl -nkube-system edit deploy migrate-controller

      Add the --disable-informer-cache=true setting to the args parameter of migrate-controller:

              name: migrate-controller
              args:
              - --disable-informer-cache=true
    • For other causes of error or if you do not want to reduce the restoration speed, run the following command to adjust the resource limit of the Deployment.

      Set <deploy-name> of csdr-controller-*** to csdr-controller and set <deploy-name> of csdr-velero-*** to csdr-velero.

      kubectl patch deploy  <deploy-name> -p '{"spec":{"containers":{"resources":{"limits":"<new-limit-memory>"}}}}'
  • If the components cannot be launched due to insufficient Cloud Backup permissions, perform the following steps:

    1. Make sure that Cloud Backup is activated for the cluster.

      • If Cloud Backup is not activated, activate it. For more information, see Cloud Backup.

      • If Cloud Backup is activated, proceed to the next step.

    2. Make sure that the ACK dedicated cluster or registered cluster has Cloud Backup permissions.

    3. Run the following command to check whether the token required by the Cloud Backup client exists:

      kubectl describe <hbr-client-***>

      If a couldnt find key HBR_TOKEN event is generated, the token does not exist. Perform the following steps to resolve the issue:

      1. Run the following command to query the node that hosts hbr-client-***:

        kubectl get pod <hbr-client-***> -n csdr -owide
      2. Run the following command to change the value of labels: csdr.alibabacloud.com/agent-enable from true to false for the node.

        kubectl label node <node-name> csdr.alibabacloud.com/agent-enable=false --overwrite
        Important
        • When the system reruns the backup or restore task, the system automatically creates a token and launches hbr-client.

        • You cannot launch hbr-client by copying a token from another cluster to the current cluster. You need to delete the copied token and the corresponding hbr-client-*** pod and repeat the preceding steps.

Cause 2 and solution: The cluster does not have snapshot permissions to create disk snapshots

When you back up the disk volume that is mounted to your application, if the backup task remains in the InProgress state for a long period of time, run the following command to query the newly created VolumeSnapshots in the cluster:

kubectl get volumesnapshot -n <backup-namespace>

Expected output:

NAME                    READYTOUSE      SOURCEPVC         SOURCESNAPSHOTCONTENT         ...
<volumesnapshot-name>   true                              <volumesnapshotcontent-name>  ...

If the READYTOUSE state of all VolumeSnapshots remains false for a long period of time, perform the following steps:

  1. Log on to the Elastic Compute Service (ECS) console and check whether the disk snapshot feature is enabled.

    • If the feature is disabled, enable the feature in the corresponding region. For more information, see Activate ECS Snapshot.

    • If the feature is enabled, proceed to the next step.

  2. Check whether the CSI component of the cluster runs as expected.

    kubectl -nkube-system get pod -l app=csi-provisioner
  3. Check whether the permissions to use disk snapshots are granted.

    ACK dedicated clusters

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.

    3. On the Cluster Information page, click the Cluster Resources tab. Click the link next to Master RAM Role to go to the permission management page.

    4. On the Policies page, check whether the permissions to use disk snapshots are granted.

      • If the k8sMasterRolePolicy-Csi-*** policy exists and the policy provides the k8sMasterRolePolicy-Csi-*** and k8sMasterRolePolicy-Csi-*** permissions, the required permissions are granted. In this case, submit a ticket.

      • If the k8sMasterRolePolicy-Csi-*** policy does not exist, attach the following policy to the master RAM role to grant the permissions to use disk snapshots. For more information, see Create custom policies and Grant permissions to a RAM role.

        {
            "Version": "1",
            "Statement": [
                {
                    "Action": [
                        "ecs:DescribeDisks",
                        "ecs:DescribeInstances",
                        "ecs:DescribeAvailableResource",
                        "ecs:DescribeInstanceTypes",
                        "nas:DescribeFileSystems",
                        "ecs:CreateSnapshot",
                        "ecs:DeleteSnapshot",
                        "ecs:DescribeSnapshotGroups",
                        "ecs:CreateAutoSnapshotPolicy",
                        "ecs:ApplyAutoSnapshotPolicy",
                        "ecs:CancelAutoSnapshotPolicy",
                        "ecs:DeleteAutoSnapshotPolicy",
                        "ecs:DescribeAutoSnapshotPolicyEX",
                        "ecs:ModifyAutoSnapshotPolicyEx",
                        "ecs:DescribeSnapshots",
                        "ecs:CopySnapshot",
                        "ecs:CreateSnapshotGroup",
                        "ecs:DeleteSnapshotGroup"
                    ],
                    "Resource": [
                        "*"
                    ],
                    "Effect": "Allow"
                }
            ]
        }
    5. If the issue persists after you perform the preceding steps, submit a ticket.

    ACK managed clusters

    1. Log on to the RAM console as a RAM user who has administrative rights.

    2. In the left-side navigation pane, choose Identities > Roles.

    3. On the Roles page, enter AliyunCSManagedCsiRole in the search box. Check whether the policy of the role contains the following content:

      {
          "Version": "1",
          "Statement": [
              {
                  "Action": [
                      "ecs:DescribeDisks",
                      "ecs:DescribeInstances",
                      "ecs:DescribeAvailableResource",
                      "ecs:DescribeInstanceTypes",
                      "nas:DescribeFileSystems",
                      "ecs:CreateSnapshot",
                      "ecs:DeleteSnapshot",
                      "ecs:DescribeSnapshotGroups",
                      "ecs:CreateAutoSnapshotPolicy",
                      "ecs:ApplyAutoSnapshotPolicy",
                      "ecs:CancelAutoSnapshotPolicy",
                      "ecs:DeleteAutoSnapshotPolicy",
                      "ecs:DescribeAutoSnapshotPolicyEX",
                      "ecs:ModifyAutoSnapshotPolicyEx",
                      "ecs:DescribeSnapshots",
                      "ecs:CopySnapshot",
                      "ecs:CreateSnapshotGroup",
                      "ecs:DeleteSnapshotGroup"
                  ],
                  "Resource": [
                      "*"
                  ],
                  "Effect": "Allow"
              }
          ]
      }

    Registered cluster

    The disk snapshot feature is available only for registered clusters that contain only ECS nodes. Check whether you have the permissions to use the CSI plug-in. For more information, see Step 1: Grant a RAM user the permissions to manage the CSI plug-in.

Cause 3 and solution: Volume types other than disk volumes are used

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions. If you are using a storage service that supports public access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned OSS volume.

What do I do if the status of the backup task is Failed and the "backup already exists in OSS bucket" error is returned?

Symptoms

The status of the backup task is Failed and the "backup already exists in OSS bucket" error is returned.

Causes

A backup with the same name is stored in the OSS bucket associated with the backup vault.

Reasons why the backup is invisible in the current cluster:

  • Backups in ongoing backup tasks and failed backup tasks are not synchronized to other clusters.

  • If you delete a backup in a cluster other than the backup cluster, the backup file in the OSS bucket is labeled but not deleted. The labeled backup file will not be synchronized to newly associated clusters.

  • The current cluster is not associated with the backup vault that stores the backup, which means that the backup vault is not initialized.

Solutions

Recreate a backup vault with another name.

What do I do if the status of the backup task is Failed and the "get target namespace failed" error is returned?

Symptoms

The status of the backup task is Failed and the "get target namespace failed" error is returned.

Causes

In most cases, this error occurs in backup tasks that are created at a scheduled time. The cause varies based on the way how you select namespaces.

  • If you create an include list, the cause is that all selected namespaces are deleted.

  • If you create an exclude list, the cause is that no namespace other than the excluded namespaces exists in the cluster.

Solutions

Modify the backup plan to change the method that is used to select namespaces and change the namespaces that you have selected.

What do I do if the status of the backup task is Failed and the "velero backup process timeout" error is returned?

Symptoms

The status of the backup task is Failed and the "velero backup process timeout" error is returned.

Causes

  • Cause 1: A subtask of the backup task times out. The duration of a subtask varies based on the amount of cluster resources and the response latency of the API server. In migrate-controller 1.7.7 and later, the default timeout period of subtasks is 60 minutes.

  • Cause 2: The storage class of the OSS bucket associated with the backup vault is Archive, Cold Archive, or Deep Cold Archive. To ensure data consistency during the backup process, files that record metadata must be updated by the backup center component on the OSS server. The backup center component cannot update files that are not restored.

Solutions

  • Solution 1: Modify the global timeout period of subtasks in the backup cluster.

    Run the following command to add velero_timeout_minutes to applicationBackup. Unit: minutes.

    kubectl edit -ncsdr cm csdr-config

    For example, the following code block sets the timeout period to 100 minutes:

    apiVersion: v1
    data:
      applicationBackup: |
        ... #Details not shown.
        velero_timeout_minutes: 100

    After you modify the timeout period, run the following command to restart csdr-controller for the modification to take effect:

    kubectl -ncsdr delete pod -l control-plane=csdr-controller
  • Solution 2: Change the storage class of the OSS bucket to Standard.

    If you want to use the Standard storage class for data stored in the OSS bucket, you can configure lifecycle rules to automate storage class conversion. Before you restore the backup file, you must restore the data in the OSS bucket. For more information, see Convert storage classes.

What do I do if the status of the backup task is Failed and the "HBR backup request failed" error is returned?

Symptoms

The status of the backup task is Failed and the "HBR backup request failed" error is returned.

Causes

  • Cause 1: The volume plug-in used by the cluster is incompatible with Cloud Backup.

  • Cause 2: Cloud Backup does not support creating backups for volumes whose volume mode is Block. For more information, see Volume Mode.

  • Cause 3: The Cloud Backup client encounters an exception. In this case, tasks that back up or restore file system volumes, such as OSS volumes, Apsara File Storage NAS (NAS) volumes, CPFS volumes, or local volumes, will time out or fail.

Solutions

  • Solution 1: If your cluster does not use the CSI plug-in or the cluster does not use common Kubernetes volumes, such as Network File System (NFS) volumes or local volumes, submit a ticket.

  • Solution 2: Submit a ticket.

  • Solution 3: Perform the following steps:

    1. Log on to the Cloud Backup console.

    2. In the left-side navigation pane, choose Backup > Container Backup.

    3. In the top navigation bar, select a region.

    4. On the Backup Jobs tab, search for <backup-name>-hbr in the Job Name search box, check the status of the backup task, and identify the cause. For more information, see View cluster backups.

      Note

      To query StorageClass conversion or backup tasks, search for the corresponding backup names.

What do I do if the status of the backup task is Failed and the "HBR get empty backup info" error is returned?

Symptoms

The status of the backup task is Failed and the "HBR get empty backup info" error is returned.

Causes

In hybrid cloud scenarios, the backup center uses the Kubernetes standard volume mount path as the data backup path by default. For example, the default mount path of a standard CSI plug-in is /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pv-name>/mount. NFS, FlexVolume, and other plug-ins supported by Kubernetes also use this mount path.

In the preceding mount path, /var/lib/kubelet is the default root path of the kubelet. If you modify this path for your Kubernetes cluster, Cloud Backup may fail to access the data to be backed up.

Solutions

Log on to the node to which the volume is mounted and perform the following steps to troubleshoot the issue:

  1. Check whether the root path of the kubelet of the node is modified.

    1. Run the following command to query the kubelet startup command:

      ps -elf | grep kubelet

      If the startup command contains the --root-dir parameter, the value of this parameter is the root path of the kubelet.

      If the startup command contains the --config parameter, the value of this parameter is the configuration file of the kubelet. Query the configuration file. If the configuration file contains the root-dir field, the value of this field is the root path of the kubelet.

    2. If the startup command does not contain the root path information, query the startup file of the kubelet: /etc/systemd/system/kubelet.service. Check whether the startup file contains the EnvironmentFile field, such as:

      EnvironmentFile=-/etc/kubernetes/kubelet

      If yes, the environment variable configuration file is /etc/kubernetes/kubelet. Query the configuration file and check whether it contains the following content:

      ROOT_DIR="--root-dir=/xxx"

      If yes, /xxx is the root path of the kubelet.

    3. If no path modifications are found, the kubelet root path is the default path: /var/lib/kubelet.

  2. Run the following command to query whether the kubelet root path is a symbolic link of another path:

    ls -al <root-dir>

    Example:

    lrwxrwxrwx   1 root root   26 Dec  4 10:51 kubelet -> /var/lib/container/kubelet

    In the preceding example, the kubelet root path is /var/lib/container/kubelet.

  3. Verify that the root path contains the data of the backup volume.

    Make sure that the mount path of the volume <root-dir>/pods/<pod-uid>/volumes exists and that the subpath of the desired type of volume exists, such as kubernetes.io~csi and kubernetes.io~nfs.

  4. Append the environment variable KUBELET_ROOT_PATH = /var/lib/container/kubelet/pods to the csdr-controller Deployment in the csdr namespace. /var/lib/container/kubelet is the actual root path of the kubelet that is obtained by querying the configuration file and the symbolic link.

What do I do if the status of the backup task is Failed and the "check backup files in OSS bucket failed", "upload backup files to OSS bucket failed", or "download backup files from OSS bucket failed" error is returned?

Symptoms

The status of the backup task is Failed and the "upload backup files to OSS bucket failed" error is returned.

Causes

The preceding error is returned by the OSS server when the backup center component checks backup, uploads, or downloads backup files in the OSS bucket associated with the backup vault. Possible causes:

  • Cause 1: Data encryption is enabled for the OSS bucket but the backup vault does not have the permissions to access Key Management Service (KMS).

  • Cause 2: You use an ACK dedicated cluster or registered cluster. However, when you install the backup center component, specific read and write permissions are not granted to the component.

  • Cause 3: You use an ACK dedicated cluster or registered cluster. However, the AccessKey pair of the RAM user you use is revoked.

Solutions

What do I do if the status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned?

Symptoms

The status of the backup task is PartiallyFailed and the "PROCESS velero partially completed" error is returned.

Causes

When you use the Velero component to back up applications (resources in the cluster), the component fails to back up some resources.

Solutions

Run the following command to query the resources that the component fails to back up and identify the cause:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name>

Fix the issue based on the content of the Errors and Warnings fields in the output.

If the cause of failure is not displayed, run the following command to query the logs of the Velero component:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero backup logs <backup-name>

If you cannot identify the cause, submit a ticket.

What do I do if the status of the backup task is PartiallyFailed and the "PROCESS hbr partially completed" error is returned?

Symptoms

The status of the backup task is PartiallyFailed and the "PROCESS hbr partially completed" error is returned.

Causes

When you use Cloud Backup to back up file system volumes, such as OSS volumes, NAS volumes, CPFS volumes, or local volumes, Cloud Backup fails to back up some resources. Possible causes:

  • Cause 1: The volume plug-in does not support Cloud Backup.

  • Cause 2: Cloud Backup cannot guarantee data consistency. If files are deleted during the backup process, the backup task fails.

Solutions

  1. Log on to the Cloud Backup console.

  2. In the left-side navigation pane, choose Backup > Container Backup.

  3. In the top navigation bar, select a region.

  4. On the Backup Jobs tab, search for <backup-name>-hbr in the Job Name search box and identify the cause of the backup failure. For more information, see the "View cluster backups" section of the Back up ACK clusters topic.

What do I do if the status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "storageclass xxx not exists" error is returned?

Symptoms

The status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "storageclass xxx not exists" error is returned.

Causes

The StorageClass to which the current StorageClass is converted does not exist in the current cluster.

Solutions

  1. Run the following command to reset the StorageClass conversion (FKA snapshot creation) task:

    cat << EOF | kubectl apply -f -
    apiVersion: csdr.alibabacloud.com/v1beta1
    kind: DeleteRequest
    metadata:
      name: reset-convert
      namespace: csdr
    spec:
      deleteObjectName: "<backup-name>"
      deleteObjectType: "Convert"
    EOF
  2. Create the desired StorageClass in the current cluster.

  3. Rerun the StorageClass conversion (FKA snapshot creation) task.

What do I do if the status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" error is returned?

Symptoms

The status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "only support convert to storageclass with CSI diskplugin or nasplugin provisioner" error is returned.

Causes

The StorageClass to which the current StorageClass is converted is not supported by the CSI component.

Solutions

  • The current CSI version supports snapshots of only disk volumes and NAS volumes. If you want to use snapshots of other volume types, submit a ticket.

  • If you are using a storage service that supports public access, such as OSS, you need to create a statically provisioned PVC and PV and then directly restore the application. No StorageClass conversion is needed. For more information, see Mount a statically provisioned OSS volume.

What do I do if the status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "current cluster is multi-zoned" error is returned?

Symptoms

The status of the StorageClass conversion (FKA snapshot creation) task is ConvertFailed and the "current cluster is multi-zoned" error is returned.

Causes

The current cluster is a multi-zone cluster. The StorageClass to which the current StorageClass is converted is disk volume and the volumeBindingMode is set to Immediate. If you use disk volumes in a multi-zone cluster, pods cannot be scheduled to the specified node and remain in the Pending state after disk volumes are created and mounted to the pods. For more information about the volumeBindingMode, see Disk volume overview.

Solutions

  1. Run the following command to reset the StorageClass conversion (FKA snapshot creation) task:

    cat << EOF | kubectl apply -f -
    apiVersion: csdr.alibabacloud.com/v1beta1
    kind: DeleteRequest
    metadata:
      name: reset-convert
      namespace: csdr
    spec:
      deleteObjectName: "<backup-name>"
      deleteObjectType: "Convert"
    EOF
  2. To convert the StorageClass to disk volume, perform the following steps.

    • To use the console to convert the StorageClass to disk volume, select alicloud-disk. By default, alicloud-disk uses the alicloud-disk-topology-alltype StorageClass.

    • To use the CLI to convert the StorageClass to disk volume, we recommend that you select alicloud-disk-topology-alltype, which is the default StorageClass provided by the CSI plug-in. You can also set volumeBindingMode to WaitForFirstConsumer.

  3. Rerun the StorageClass conversion (FKA snapshot creation) task.

What do I do if the status of the restore task is Failed and the "only disk type PVs support cross-region restore in current version" error is returned?

Symptoms

The status of the restore task is Failed and the "only disk type PVs support cross-region restore in current version" error is returned.

Causes

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. Backups of other volume types cannot be restored across regions.

Solutions

  • If you are using a storage service that supports public access, such as OSS, you can create a statically provisioned PV and PVC and then restore the application. For more information, see Mount a statically provisioned OSS volume.

  • If you want to restore backups of other volume types across regions, submit a ticket.

What do I do if the status of the restore task is Failed and the "ECS snapshot cross region request failed" error is returned?

Symptoms

The status of the restore task is Failed and the "ECS snapshot cross region request failed" error is returned.

Symptoms

In migrate-controller 1.7.7 and later versions, backups of disk volumes can be restored across regions. However, your cluster is unauthorized to use ECS disk snapshots.

Solutions

If your cluster is an ACK dedicated cluster or a registered cluster that is connected to a self-managed Kubernetes cluster, you need to authorize the cluster to use ECS disk snapshots. For more information, see the "Registered cluster" section of the Install migrate-controller and grant permissions topic.

What do I do if the status of the restore task is Failed and the "accessMode of PVC xxx is xxx" error is returned?

Symptoms

The status of the restore task is Failed and the "accessMode of PVC xxx is xxx" error is returned.

Causes

The AccessMode parameter of the PVC used to provision the disk volume that you want to restore is set to ReadOnlyMany or ReadWriteMany.

When you restore the disk volume, the new volume is mounted by using CSI. Take note of the following items when you use the current version of CSI:

  • If you want to mount a disk volume to multiple ECS instances, you must enable the multi-attach feature for the disk associated with the volume.

  • If the VolumeMode parameter in the PVC is set to Filesystem, a disk that uses the ext4 or xfs file system is mounted. In this case, the disk volume supports only the ReadOnlyMany access mode.

For more information, see Use a dynamically provisioned disk volume.

Solutions

  • If the restore task uses the StorageClass conversion feature to convert a volume that supports multi-attach, such as an OSS or NAS volume, into a disk volume, we recommend that you create a new restore task. Configure the new restore task to convert the StorageClass into alibabacloud-cnfs-nas. This way, the task creates a new CNFS-managed NAS volume to ensure data sharing among different pods of the application to which the volume is mounted. For more information, see Use CNFS to manage NAS file systems (recommended).

  • If the backup disk volume is provisioned by using CSI of a relatively earlier version that does not check the AccessMode parameter of the volume, and the backup volume cannot be created by the CSI version used by the current cluster, we recommend that you modify the original application by using a dynamically provisioned disk volume. This prevents the disk from being detached when the volume is mounted to other nodes. If you have questions about multi-attach scenarios, submit a ticket.

What do I do if the status of the restore task is Completed but some resources are not created in the restore cluster?

Symptoms

The status of the restore task is Completed but some resources are not created in the restore cluster.

Causes

  • Cause 1: No backups are created for the resources.

  • Cause 2: The resources are excluded from the restore list.

  • Cause 3: Some subtasks in the application restore task failed.

  • Cause 4: The resources are restored but then reclaimed due to ownerReferences or business logic.

Solutions

Solution 1:

Run the following command to query backup details:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe backup <backup-name> --details

Check whether backups are created for the resources. If no backups are created, make sure that the resources and the namespaces of the resources are specified in the include list of the backup task, or the resources and namespaces are not specified in the exclude list of the backup task. Then, rerun the backup task. Cluster-level pod resources are not backed up if the backup task is not configured to back up the namespace of the pods. To back up all cluster-level resources, see Step 3: Create backups in the backup cluster.

Solution 2:

If the resources are not restored, make sure that the resources and the namespaces of the resources are specified in the include list of the restore task, or the resources and namespaces are not specified in the exclude list of the restore task. Then, rerun the restore task.

Solution 3:

Run the following command to query the resources and identify the cause:

 kubectl -ncsdr exec -it $(kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1) -- ./velero describe restore <restore-name> 

Fix the issue based on the content of the Errors and Warnings fields in the output. If you cannot identify the cause, submit a ticket.

Solution 4:

Query the auditing records of the resources and check whether the resources are accidentally deleted.

What do I do if the migrate-controller component in a cluster that uses FlexVolume cannot be launched?

migrate-controller does not support clusters that use FlexVolume. To use the backup center feature, use one of the following methods to migrate from FlexVolume to CSI:

If you want to create backups in the FlexVolume cluster and restore the backups in the CSI cluster when migrating from FlexVolume to CSI, refer to Use the backup center to migrate applications in an ACK cluster that runs an old Kubernetes version.

Can I modify the backup vault?

You cannot modify the backup vault of the backup center. You can only delete the current one and create a backup vault with another name.

Backup vaults are shared resources. Existing backup vaults may be in the Backing Up or Restoring state. If you modify a parameter of the backup vault, the system may fail to find the required data when backing up or restoring an application. Therefore, you cannot modify the backup vault or create back vaults that use the same name.

Can I associate the backup vault with an OSS bucket that is not named in the "cnfs-oss-*" format?

For clusters other than ACK dedicated clusters and registered clusters, the backup center component has read and write permissions on OSS buckets named in the cnfs-oss-* format by default. To prevent backups from overwriting the original data in the OSS buckets, we recommend that you create an OSS bucket named in the cnfs-oss-* format in the backup center.

  1. To associate an OSS bucket that is not named in the cnfs-oss-* format with the backup vault, you need to grant permissions to the backup center component. For more information, see ACK dedicated clusters.

  2. After you grant permissions, run the following command to restart the component:

    kubectl -ncsdr delete pod -l control-plane=csdr-controller
    kubectl -ncsdr delete pod -l component=csdr

    If an OSS bucket that is not named in the cnfs-oss-* format is already associated with the backup vault, rerun the backup or restore task after the connectivity test is complete and the status of the backup vault changes to Available. The connectivity test is performed at intervals of 5 minutes. You can run the following command to query the status of the backup vault:

    kubectl -ncsdr get backuplocation

    Expected output:

    NAME                    PHASE       LAST VALIDATED   AGE
    a-test-backuplocation   Available   7s               6d1h

How do I specify the backup cycle when creating a backup plan?

You can specify the backup cycle by using a crontab expression, such as 1 4 * * *. You can also directly specify an interval. For example, if you set the backup cycle to 6h30m, the backup operation is performed every 6 hours and 30 minutes.

The asterisks (*) in the crontab expression represent any valid values of the corresponding fields. Valid values of the minute field are 0 to 59. Sample crontab expressions:

  • 1 4 * * *: The backup operation is performed at 4:01 am each day.

  • 0 2 15 * 1: The backup operation is performed at 2:00 am on the 15th day of each month.

 *  *  *  *  * 
 |  |  |  |  |
 |  |  |  |  ·----- day of week (0 - 6) (Sun to Sat)
 |  |  |  ·-------- month (1 - 12) 
 |  |  .----------- day of month (1 - 31)
 |  ·-------------- hour (0 - 23) 
 ·----------------- minute (0 - 59)  
 

What are the default changes in the resource YAML file when I run a restore task?

When you restore resources, the following changes are made to the YAML files of resources:

Change 1:

If the size of a disk volume is less than 20 GiB, the volume size is changed to 20 GiB.

Change 2:

Services are restored based on Service types:

  • NodePort Services: The ports of NodePort Services are retained by default during cross-cluster restoration.

  • LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. To retain the port, specify spec.preserveNodePorts: true when you create the restore task.

    • If a Service in the backup cluster uses an existing Server Load Balancer (SLB) instance, the Service restored in the restore cluster still uses the original SLB instance but has all listeners disabled by default. You need to configure the listeners in the SLB console.

    • LoadBalancer Services in the backup cluster are managed by the cloud controller manager (CCM). When the system restores these Services, the CCM will create SLB instances. For more information, see Considerations for configuring a LoadBalancer Service.

How do I view backup resources?

Application-related backup resources

The YAML files in the cluster are stored in the OSS bucket associated with the backup vault. You can use one of the following methods to view backup resources.

  • Run the following command in a cluster to which backup files are synchronized to view backup resources:

    kubectl -ncsdr get pod -l component=csdr | tail -n 1 | cut -d ' ' -f1
    kubectl -ncsdr exec -it csdr-velero-xxx -cvelero -- ./velero describe backup <backup-name> --details
  • View backup resources in the ACK console.

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Application Backup.

    3. On the Application Backup page, click the Backup Records tab. In the Backup Records column, click the backup record that you want to view.

Disk volume-related backup resources

  1. Log on to the ECS console.

  2. In the left-side navigation pane, choose Storage & Snapshots > Snapshots.

  3. In the top navigation bar, select the region and resource group to which the resource belongs. 地域

  4. On the Snapshots page, query snapshots based on the disk ID.

Other backup resources

  1. Log on to the Cloud Backup console.

  2. In the left-side navigation pane, choose Backup > Container Backup.

  3. In the top navigation bar, select a region.

  4. View the basic information of cluster backups.

    • Clusters: The list of clusters that have been backed up and protected. Click ACK Cluster ID to view the protected persistent volume claims (PVCs). For more information about PVCs, see Persistent volume claim (PVC).

      If Client Status is abnormal, Cloud Backup is not running as expected in the ACK cluster. Go to the DaemonSets page in the ACK console to troubleshoot the issue.image

    • Backup Jobs: The status of backup jobs.

      image

If I back up data in a cluster that runs an earlier Kubernetes version, can I restore the data in a cluster that runs a later Kubernetes version?

Yes.

By default, when you back up resources, all API versions supported by the resources are backed up. For example, a Deployment in a cluster that runs Kubernetes 1.16 supports extensions/v1beta1, apps/v1beta1, apps/v1beta2, and apps/v1. When you back up the Deployment, the backup vault stores all four API versions regardless of which version you use when you create the Deployment. The KubernetesConvert feature is used for API version conversion.

When you restore resources, the API version recommended by the restore cluster is used to restore the resources. For example, if you restore the preceding Deployment in a cluster that runs Kubernetes 1.28 and the recommended API version is apps/v1, the restored Deployment uses apps/v1.

Important

If no API version is supported by both clusters, you must manually deploy the resource. For example, Ingresses in clusters that run Kubernetes 1.16 support extensions/v1beta1 and networking.k8s.io/v1beta1. You cannot restore the Ingresses in clusters that run Kubernetes 1.22 or later because Ingresses in these clusters support only networking.k8s.io/v1. For more information about API version migration, see Official documentation. Due to API version compatibility issues, we recommend that you do not use the backup center to migrate applications from clusters of later Kubernetes versions to clusters of earlier Kubernetes versions. We also recommend that you do not migrate applications from clusters of Kubernetes versions earlier than 1.16 to clusters of later Kubernetes versions.

Is traffic automatically switched to new SLB instances when I run a restore task?

No, cGPU does not affect the billing of GPUs.

Services are restored based on Service types:

  • NodePort Services: The ports of NodePort Services are retained by default during cross-cluster restoration.

  • LoadBalancer Services: When ExternalTrafficPolicy is set to Local, HealthCheckNodePort uses a random port by default. To retain the port, specify spec.preserveNodePorts: true when you create the restore task.

    • If a Service in the backup cluster uses an existing Server Load Balancer (SLB) instance, the Service restored in the restore cluster still uses the original SLB instance but has all listeners disabled by default. You need to configure the listeners in the SLB console.

    • LoadBalancer Services in the backup cluster are managed by the cloud controller manager (CCM). When the system restores these Services, the CCM will create SLB instances. For more information, see Considerations for configuring a LoadBalancer Service.

By default, after listeners are disabled or new SLB instances are used, traffic is not automatically switched to the new SLB instances. If you use other cloud services or third-party service discovery and do not want service discovery to switch traffic to new SLB instances, you can exclude Services when you back up resources. You can manually deploy Services when you want to switch traffic.

Why are resources in the csdr, kube-system, kube-public, and kube-node namespaces not backed up by default?

csdr is the namespace of the backup center. If you directly back up and restore the namespace, components fail to work in the restore cluster. In addition, the backup and synchronization logic of the backup center does not require you to manually migrate backups to a new cluster.

kube-system, kube-public, and kube-node-lease are the default system namespaces of Kubernetes clusters. Due to the differences in cluster parameters and configurations, you cannot restore the namespaces across clusters. The backup center is used to back up and restore applications. Before you run a restore task, you must install and configure system components in the restore cluster, For example, the following add-ons are automatically installed when the system creates a cluster:

  • Container Registry password-free image pulling component: You need to grant permissions to and configure acr-configuration in the restore cluster.

  • ALB Ingresses: You need to configure ALBConfigs.

You cannot directly restore kube-system components in the new cluster. Otherwise, the system components cannot work as expected.

Does the backup center use ECS disk snapshots to back up disks? What is the default snapshot type?

In the following scenarios, the backup center uses ECS disk snapshots to back up disks.

  1. The cluster is an ACK managed cluster or ACK dedicated cluster.

  2. The Kubernetes version of the cluster is 1.18 or later, and the cluster CSI plug-in version is 1.18 or later.

In other scenarios, the backup center uses Cloud Backup to back up disks.

By default, the instant access feature is enabled for disk snapshots created by the backup center. The validity period of the snapshots is the same as the validity period specified in the backup configuration. Starting 11:00 (UTC+8) on October 12, 2023, you are no longer charged storage fees and feature usage fees for the instant access feature. For more information, see Use the instant access feature.

Why is the validity period of a disk snapshot created by the backup center different from the validity period specified in the backup configuration?

The creation of disk snapshots depends on the csi-provisioner component or managed-csiprovisioner component of a cluster. If the version of the csi-provisioner component is earlier than 1.20.6, you cannot specify the validity period or enable the instant access feature when you create VolumeSnapshots. In this case, the validity period in the backup configuration does not take effect on disk snapshots.

Therefore, when you back up disk volumes, you need to update the csi-provisioner component to 1.20.6 or later.

If csi-provisioner cannot be updated to this version, you can configure the default snapshot validity period in the following ways:

  1. Update the backup center component migrate-controller to v1.7.10 or later.

  2. Run the following command to check whether a VolumeSnapshotClass whose retentionDays is 30 exists in the cluster.

    kubectl get volumesnapshotclass csdr-disk-snapshot-with-default-ttl
    • If the VolumeSnapshotClass does not exist, use the following YAML to create a VolumeSnapshotClass named csdr-disk-snapshot-with-default-ttl.

    • If the VolumeSnapshotClass exists, set retentionDays to 30.

      apiVersion: snapshot.storage.k8s.io/v1
      deletionPolicy: Retain
      driver: diskplugin.csi.alibabacloud.com
      kind: VolumeSnapshotClass
      metadata:
        name: csdr-disk-snapshot-with-default-ttl
      parameters:
        retentionDays: "30"
  3. After the configuration is complete, when you back up disk volumes, disk snapshots whose validity period is the same as the value of retentionDays are created.

    Important

    To ensure that the validity period of ECS disk snapshots created by the backup center is the same as the validity period specified in the backup configuration, we recommend that you update the csi-provisioner component to v1.20.6 or later.

What scenarios are suitable for backing up volumes and what do I do if I want to back up volumes?

What is volume backup?

You can use ECS disk snapshots or Cloud Backup to back up data stored in volumes to datastores in the cloud. Then, you can restore the data from the backup files to disks or NAS file systems used by your application. The original application and restored application do not share the data source.

If you do not need to replicate the data or share the data source, you can skip the volume backup step. Make sure that the exclude list of the backup task does not include PVCs or PVs. When you restore the application, directly deploy the YAML file of the original volume in the restore cluster.

What scenarios are suitable for backing up volumes?

  • Disaster recovery and version recording.

  • Only disk volumes are used. Each basic disk can be mounted to only one node.

  • Cross-region backup and restoration. In most cases, only OSS supports inter-region communication.

  • The data of the application in the backup cluster must be isolated from the data of the application in the restore cluster.

  • The backup and restore clusters use different volume plug-ins or the version difference is great. In this case, you cannot directly use the YAML file of the original volume.

What risks may arise if I do not back up the volumes used by a stateful application?

When you create a backup task to back up a stateful application, if you do not back up the volumes used by the application, the following operations are performed when the application is restored:

  • Volumes that use the Delete reclaim policy:

    If the StorageClass used by the original volume exists in the restore cluster, CSI automatically provisions a new PV based on the StorageClass and the PVC used by the application. This operation is similar to deploying a PVC for the first time. For example, if the stateful application uses a disk volume, a new empty disk is mounted to the new application created by the restore task. If the original volume is statically provisioned without using a StorageClass or the StorageClass used by the original volume does not exist in the restore cluster, the new PVC and pod created by the restore task remain in the Pending state until you manually create a PV or StorageClass.

  • Volumes that use the Retain reclaim policy:

    The PV is restored before the PVC when the restore task restores the application based on its YAML file. If the original volume supports multi-attach, such as an OSS or NAS volume, the new volume can use the original file system or OSS bucket. If the original volume is a disk volume, the disk associated with the volume may be forcefully detached.

To view the reclaim policy of a volume, run the following command:

kubectl get pv -o=custom-columns=CLAIM:.spec.claimRef.name,NAMESPACE:.spec.claimRef.namespace,NAME:.metadata.name,RECLAIMPOLICY:.spec.persistentVolumeReclaimPolicy

Expected output:

CLAIM               NAMESPACE           NAME                                       RECLAIMPOLICY
www-web-0           default             d-2ze53mvwvrt4o3xxxxxx                     Delete
essd-pvc-0          default             d-2ze5o2kq5yg4kdxxxxxx                     Delete
www-web-1           default             d-2ze7plpd4247c5xxxxxx                     Delete
pvc-oss             default             oss-e5923d5a-10c1-xxxx-xxxx-7fdf82xxxxxx   Retain

What do I do if I want to back up volumes?

  • When you create a backup task in the console, select Volume Backup.

  • When you use kubectl to create a backup task, set the spec.pvBackup.defaultPvBackup parameter to true.

What are the use scenarios of application backup and data protection?

Application backup

  • You want to back up your business in your cluster, including applications, Services, and configuration files.

  • Optional: When you back up an application, you want to also back up the volumes mounted to the application.

    Note

    The application backup feature does not back up volumes that are not mounted to pods.

    If you want to backup applications and all volumes, you can create data protection backup tasks.

  • You want to migrate applications between clusters and quickly restore applications for disaster recovery.

Data protection (new):

  • You want to backup volumes, including only PVCs and PVs.

  • You want to restore PVCs, which are independent of the backup data. When you use the backup center to restore a deleted PVC, a new disk is created and the data on the disk is the same as the data in the backup file. In this case, the mount parameters of the new PVC remain unchanged. The new PVC can be directly mounted to applications.

  • You want to implement data replication and disaster recovery.

Can I enable data encryption for an OSS bucket that is associated with the backup center? When I enable server-side encryption based on KMS, how do I grant permissions to the backup center?

OSS buckets support server-side encryption and client-side encryption. The backup center feature supports only server-side encryption. You can enable server-side encryption for the OSS bucket associated with the backup center and configure the encryption method in the OSS console. For more information, see Server-side encryption.

  • If you use Bring Your Own Key (BYOK) material to generate a customer master key (CMK), and want to use the CMK to encrypt and decrypt data, you must specify the CMK ID in the permissions of the backup center. You need to grant the backup center the permissions to access KMS. Perform the following steps:

    • Create a custom policy based on the following code block. For more information, see Create custom policies.

      {
        "Version": "1",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "kms:List*",
              "kms:DescribeKey",
              "kms:GenerateDataKey",
              "kms:Decrypt"
            ],
            "Resource": [
              "acs:kms:*:141661496593****:*"
            ]
          }
        ]
      }

      The preceding policy includes the permissions to access all KMS keys that belong to the current Alibaba Cloud account. If you want to configure fine-grained permission control on specific resources, see Authorization information.

    • If you use an ACK dedicated cluster or registered cluster, you need to grant permissions to the RAM user used to install the backup center component. For more information, see Grant permissions to a RAM user. If you use a cluster of other types, you need to grant permissions to the AliyunCSManagedBackupRestoreRole role. For more information, see Grant permissions to a RAM role.

  • If you use the default KMS-managed CMK or an OSS-managed key to encrypt or decrypt data, no authorization is required.

How do I change the image of an application when I restore the application from a backup file?

Assume that the application in a backup file uses the docker.io/library/app1:v1 image.

  • Change the image repository address

    In hybrid cloud scenarios, you may need to deploy an application across the clouds of multiple cloud service providers or you may need to migrate an application from the data center to the cloud. In this case, you must upload the image used by the application to an image repository on Container Registry.

    In this case, you must use the imageRegistryMapping field to specify the image repository address. In the following example, the image address is set to registry.cn-beijing.aliyuncs.com/my-registry/app1:v1.

    docker.io/library/: registry.cn-beijing.aliyuncs.com/my-registry/
  • Change the image repository and version

    Changing the image repository and version is an advanced feature. Before you create a restore task, you must specify the change details in a ConfigMap.

    For example, if you want to change the image repository to app2:v2, create a ConfigMap based on the following code block:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: <ConfigMap name>
      namespace: csdr
      labels:
        velero.io/plugin-config: ""
        velero.io/change-image-name: RestoreItemAction
    data:
      "case1":"app1:v1,app2:v2"
      # If you want to change only the image repository, use the following setting.
      # "case1": "app1,app2"
      # If you want to change only the image version, use the following setting.
      # "case1": "v1:v2"
      # If you want to change only an image in an image repository, use the following setting.
      # "case1": "docker.io/library/app1:v1,registry.cn-beijing.aliyuncs.com/my-registry/app2:v2"

    If you want to change multiple image repositories and versions, add case2, case3, ..., and caseN to the data section.

    After you create the ConfigMap, leave the imageRegistryMapping field empty when you create a restore task.

    Note

    The changes take effect on all restore tasks in the cluster. We recommend that you configure fine-grained modifications based on the preceding description. For example, configure image changes within a single repository. If the ConfigMap is no longer required, delete it.