Suggested configurations for creating HA clusters - Container Service for Kubernetes

High availability (HA) is a system design that ensures the reliability and continuity of services. Container Service for Kubernetes (ACK) provides a variety of cluster HA mechanisms based on the Kubernetes architecture to ensure that control planes, nodes, node pools, workloads, and load balancers are highly available. This allows you to build stable, secure, and reliable cluster and application architectures.

About this topic

This topic is intended for developers and administrators of ACK clusters. The suggestions in this topic can help them plan and create HA clusters.Container Service for Kubernetes The actual configurations may vary based on the environment of your cluster and the businesses. You can reference the suggested HA configurations for control planes and data planes in this topic.

Configuration	Maintained by	Applicable cluster type
Control plane architecture HA	Maintained by ACK.	Applicable only to the following ACK clusters: ACK managed clusters (Basic and Pro editions), ACK Serverless clusters (Basic and Pro editions), ACK Edge clusters, and ACK Lingjun clusters. You need to maintain the control planes of other clusters, such as ACK dedicated clusters and registered clusters. Therefore, control plane architecture HA is not applicable to these clusters. However, you can find suggestions for these clusters in this topic.
Node pool and virtual node HA configuration	Maintained by you.	Applicable to all types of clusters.
Workload HA configuration
Load balancer HA configuration
Suggested component configuration

Cluster architecture

An ACK cluster consists of control planes and regular nodes or virtual nodes.

Control planes are in charge of cluster management and coordination, such as workload scheduling and cluster status maintenance. Take ACK managed clusters as an example. ACK managed clusters adopt the Kubernetes on Kubernetes architecture to host control plane components, such as the API server, etcd, and Kubernetes scheduler.
ACK clusters support Elastic Compute Service (ECS) nodes (regular nodes) and virtual nodes. These nodes are in charge of running workloads and providing resources required for running pods.

You can deploy an ACK cluster across multiple zones to ensure the HA of the cluster. The following figure shows the architecture of an ACK managed cluster.

Control plane architecture HA

The control planes and control plane components (such as the API server, etcd, and Kubernetes scheduler) of ACK managed clusters (Basic and Pro editions), ACK Serverless clusters (Basic and Pro editions), ACK Edge clusters, and ACK Lingjun clusters are managed by ACK.

Multi-zone HA: Each managed component runs in multiple replicated pods that are evenly spread across multiple zones to ensure that the cluster can still function as expected when a zone or node is down.
Single-zone HA: Each managed component runs in multiple replicated pods that are spread across multiple nodes to ensure that the cluster can still function as expected when a node is down.

The etcd requires at least three replicated pods and the API server requires at least two replicated pods. Elastic network interfaces (ENIs) are mounted to the replicated pods of the API server so that the pods can communicate with the virtual private cloud (VPC) of the cluster. The kubelet and kube-proxy on nodes are connected to the API server through the Classic Load Balancer (CLB) instance of the API server or ENIs.

The key managed components of ACK can scale based on the actual CPU and memory usage to dynamically meet the resource demand of the API server in order to guarantee the SLA.

In addition to the default multi-zone HA architecture used by control planes, you can also use a data plane HA architecture by referring to the Node pool and virtual node HA configuration, Workload HA configuration, Load balancer HA configuration, and Suggested component configuration sections.

Node pool and virtual node HA configuration

ACK clusters support ECS nodes (regular nodes) and virtual nodes. You can add nodes to different node pools and then upgrade, scale, or maintain nodes by node pool. If your businesses do not fluctuate or the fluctuations are predicable, you can use ECS nodes. If your businesses fluctuate frequently and the fluctuations are hard to predict, we recommend that you use virtual nodes to handle traffic spikes and reduce computing costs. For more information, see Node pool overview, Overview of managed node pools, and Virtual node.

Node pool HA configuration

You can use node auto scaling, deployment sets, and multi-zone deployment together with the topology spread constraints of the Kubernetes scheduler to ensure resource supply in different failure-domains and isolate single points of failure. This ensures service continuity when a failure-domain is down, reduces the risks of single points of failure, and improves the overall reliability and availability of the system.

Configure node auto scaling

Each node pool corresponds to a scaling group. Nodes in node pools can be manually or automatically scaled based on workload scheduling or cluster resources to reduce resource costs and flexibly allocate elastic computing resources. For more information about the auto scaling solutions provided by ACK, see Auto scaling overview and Enable node auto scaling.

Enable deployment sets

Deployment sets are policies used to control the distribution of ECS instances. A deployment set spreads ECS instances across different physical servers to avoid shutting down all ECS instances on a physical server when the physical server is down. You can specify a deployment set for a node pool to ensure that ECS instances added to the node pool are spread across different physical servers. Then, you can add affinity rules to ensure that your application pods are aware of the underlying node topology and spread the pods across different nodes in order. This ensures that your application is highly available and can recover from major disruptions. For more information about how to enable deployment sets, see Best practices for associating deployment sets with node pools.

Configure multi-zone deployment

ACK supports multi-zone node pools. When you create or run a node pool, we recommend that you select vSwitches residing in different zones for the node pool, and set Scaling Policy to Distribution Balancing. This way, ECS instances can be spread across the zones (vSwitches in the VPC) of the scaling group. If the distribution of resources among zones cannot be balanced because the inventory is insufficient, you can perform balancing operations. For more information about how to configure an auto scaling policy, see Step 2: Configure a node pool that has auto scaling enabled.

Enable topology spread constraints

You can use node auto scaling, deployment sets, and multi-zone deployment together with the topology spread constraints of the Kubernetes scheduler to isolate failure-domains at different levels. Topology-related labels are automatically added to nodes in ACK node pools, such as kubernetes.io/hostname, topology.kubernetes.io/zone, and topology.kubernetes.io/region. You can use topology spread constraints to control the distribution of pods among failure-domains to enhance the fault tolerance capability of infrastructure.

For more information about how to use topology-aware scheduling in ACK clusters, see Topology-aware scheduling. For example, you can use this feature to retry pod scheduling among multiple topology domains or schedule pods to ECS instances in a deployment set with a low latency.

Virtual node HA configuration

You can use virtual nodes to quickly schedule pods to elastic container instances. Elastic Container Instance eliminates the need to purchase or manage ECS instances and allows you to focus on application development instead of infrastructure maintenance. You can create elastic container instances to meet your business requirements. You are charged for resource usage on a per second basis.

When you horizontally scale businesses to handle traffic spikes or launch large numbers of instances to process jobs, you may encounter issues such as instances of the specified type are out of stock or vSwitch IP addresses are exhausted. Consequently, the system fails to create elastic container instances. The multi-zone deployment feature of ACK Serverless clusters can efficiently improve the success rate of elastic container instance creation.

You can configure eci-profiles on virtual nodes to specify vSwitches in different zones to implement multi-zone deployment.

An elastic container instance distributes pod creation requests to all of the specified vSwitches. This balances the loads.
If resources required for creating pods are out of stock in a vSwitch, the system automatically switches to another vSwitch.

Run the following command to specify one or more vSwitch IDs in the vSwitchIds field of the kube-system/eci-profile ConfigMap. Separate vSwitch IDs with commas (,). The modification takes effect immediately. For more information, see Create ECIs across zones.

kubectl -n kube-system edit cm eci-profile

apiVersion: v1
data:
 kube-proxy: "true"
 privatezone: "true"
 quota-cpu: "192000"
 quota-memory: 640Ti
 quota-pods: "4000"
 regionId: cn-hangzhou
 resourcegroup: ""
 securitygroupId: sg-xxx
 vpcId: vpc-xxx
 vSwitchIds: vsw-xxx,vsw-yyy,vsw-zzz
kind: ConfigMap

Workload HA configuration

The HA of workloads ensures that application pods can still run as expected or quickly recover when exceptions occur. You can use topology spread constraints, pod anti-affinity, pod disruption budgets, pod health check, and pod auto recovery to ensure the HA of application pods.

Configure topology spread constraints

Topology spread constraints are used to control the distribution of pods in Kubernetes clusters. You can use topology spread constraints to evenly spread pods across nodes and zones to improve the availability and stability of applications. This feature applies to Deployments, StatefuSets, DaemonSets, Jobs, and CronJobs.

You can configure maxSkew.topologyKey to control the distribution of pods in order to deploy pods based on the expected topology. For example, you can evenly spread the pods of the specified workload across zones to improve the availability and reliability of the application. Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-run-per-zone
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-run-per-zone
  template:
    metadata:
      labels:
        app: app-run-per-zone
    spec:
      containers:
        - name: app-container
          image: app-image
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: DoNotSchedule

Configure pod anti-affinity

Pod anti-affinity is a scheduling policy used in Kubernetes clusters. This policy ensures that pods are not scheduled to the same node. This improves the availability of applications and enhance the fault isolation capability. Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-run-per-node
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-run-per-node
  template:
    metadata:
      labels:
        app: app-run-per-node
    spec:
      containers:
        - name: app-container
          image: app-image
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - app-run-per-node
              topologyKey: "kubernetes.io/hostname"

You can also configure topology spread constraints to run only one pod on each node. When topologyKey: "kubernetes.io/hostname" is specified, each node is treated as a topology domain.

The following example adds a topology spread constraint. maxSkew is set to 1, topologyKey is set to "kubernetes.io/hostname", and whenUnsatisfiable is set to DoNotSchedule. This means that each node can host only one pod. This forces the scheduler to spread pods across nodes to ensure HA.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app-container
          image: my-app-image
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "kubernetes.io/hostname"
          whenUnsatisfiable: DoNotSchedule

Configure pod disruption budgets

You can configure pod disruption budgets to further improve the availability of applications. Pod disruption budgets allow you to define the minimum number of replicated pods that must be retained on a node. When a node is under maintenance or faulty, the cluster ensures that at least the specified number of pods are still running on the node. You can also configure pod disruption budgets to avoid the issue that an excessive number of replicated pods are concurrently terminated. Pod disruption budgets are suitable for scenarios where multiple replicated pods are deployed to process business traffic. Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-with-pdb
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-with-pdb
  template:
    metadata:
      labels:
        app: app-with-pdb
    spec:
      containers:
        - name: app-container
          image: app-container-image
          ports:
            - containerPort: 80
  ---
  apiVersion: policy/v1beta1
  kind: PodDisruptionBudget
  metadata:
    name: pdb-for-app
  spec:
    minAvailable: 2
    selector:
      matchLabels:
        app: app-with-pdb

Configure pod health check and pod auto recovery

You can use the following types of probes to monitor and manage the status and availability of containers: liveness probes, readiness probes, and startup probes. To do this, add probes and restart policies to the configurations of pods. Example:

apiVersion: v1
kind: Pod
metadata:
  name: app-with-probe
spec:
  containers:
    - name: app-container
      image: app-image
      livenessProbe:
        httpGet:
          path: /health
          port: 80
        initialDelaySeconds: 10
        periodSeconds: 5
      readinessProbe:
        tcpSocket:
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
      startupProbe:
        exec:
          command:
            - cat
            - /tmp/ready
        initialDelaySeconds: 20
        periodSeconds: 15
      restartPolicy: Always

Load balancer HA configuration

The HA of load balancers is essential to optimizing the stability, response time, and fault isolation capability of services. You can specify primary and secondary zones and enable topology aware hints to ensure the HA of load balancers.

Specify primary and secondary zones for a CLB instance

CLB instances use multi-zone deployment in most regions to implement cross-data center disaster recovery within a region. You can add Service annotations to specify primary and secondary zones for a CLB instance and ensure that the primary and secondary zones are the same as the zones of ECS nodes in a node pool. This reduces cross-zone data transfer and accelerates network communication. For more information about regions and zones that support CLB, see Regions that support CLB. For more information about how to specify primary and secondary zones for a CLB instance, see Specify the primary zone and secondary zone when you create a CLB instance.

Example:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-master-zoneid: "cn-hangzhou-a"
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-slave-zoneid: "cn-hangzhou-b"
  name: nginx
  namespace: default
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: nginx
  type: LoadBalancer

Enable topology aware hints

To reduce cross-zone data transfer and accelerate network communication, topology aware routing (also known as topology aware hints) is introduced in Kubernetes 1.23 to support topology-aware nearby routing.

You can use this feature in Services. After this feature is enabled, when a zone has sufficient endpoints, the EndpointSlice controller routes traffic to the endpoint closest to the source of the traffic based on the topology hints on the EndpointSlice. In cross-zone communication scenarios, this feature preferably routes traffic within the same region to reduce costs and improve data transfer efficiency. For more information, see Topology Aware Routing.

Suggested component configuration

ACK provides a variety of components. You can deploy components in newly created or existing clusters to extend the capabilities of the clusters. For more information about ACK components and the release notes, see Release notes for components. For more information about how to update, configure, and uninstall components, see Manage components.

Properly deploy the NGINX Ingress controller

When you deploy the NGINX Ingress controller, make sure that the controller pods are distributed across different nodes. This helps prevent resource contention among controller pods and avoid single points of failure. You can schedule the controller pods to exclusive nodes to ensure the performance and stability of the NGINX Ingress controller. For more information, see Use exclusive nodes to ensure the performance and stability of the NGINX Ingress controller.

We recommend that you do not set resource limits for the NGINX Ingress controller pods. This helps prevent service interruptions that are caused by out of memory (OOM) errors. If resource limits are required, we recommend that you set the CPU limit to 1,000 millicores or greater, and set the memory limit to 2 GiB or greater. The CPU limit in the YAML file must be specified in the 1000m format. For more information about how to configure the NGINX Ingress controller, see Best practices for the NGINX Ingress controller.

Note

We recommend that you specify multiple zones when you create the Application Load Balancer (ALB) Ingress controller or Microservices Engine (MSE) Ingress controller. For more information, see Create a cloud-native gateway and Access Services by using an ALB Ingress. For more information about the difference among different Ingress controllers, see Comparison among NGINX Ingresses, ALB Ingresses, and MSE Ingresses.

Properly deploy CoreDNS

We recommend that you spread CoreDNS pods across different zones and nodes to avoid single points of failure. By default, weak node anti-affinity is configured for CoreDNS. Partial or all CoreDNS pods may be scheduled to the same node due to insufficient node resources. If this issue occurs, delete and then reschedule the pods.

CoreDNS pods must not be deployed on cluster nodes whose CPU and memory resources are fully utilized. Otherwise, the DNS QPS and response time are adversely affected. For more information about how to configure CoreDNS, see Best practices for DNS services.

References

We recommend that you create worker nodes on ECS instances with high specifications. For more information, see Suggestions on choosing ECS specifications for ACK clusters.
If you use a large ACK Pro cluster, for example, the cluster contains more than 500 nodes or 10,000 pods, view the suggestions in the Suggestions on using large ACK Pro clusters topic.
For more information about best practices for nodes and node pools, see Best practices for nodes and node pools.
For more information about the best practices for auto scaling, see Best practices for auto scaling.