DNS is a critical service in Kubernetes clusters. Under certain conditions—such as improper client configuration or large cluster scale—DNS may experience resolution timeouts and failures. This guide provides best practices to help you avoid these issues.
This topic does not apply to Container Service for Kubernetes (ACK) clusters with managed edition of CoreDNS installed or have Auto Mode enabled. These clusters automatically scale based on load without manual adjustment.
Contents
DNS best practices cover both client-side and server-side optimizations:
Client-side: You can reduce resolution latency by optimizing DNS requests, and minimize resolution failures using appropriate container images, node operating systems, and NodeLocal DNSCache.
Server-side: You can identify DNS exceptions and quickly locate their root causes by monitoring CoreDNS's running status. CoreDNS high availability and queries per second (QPS) throughput can be improved by adjusting deployment configuration.
For more details about CoreDNS, see the official CoreDNS documentation.
Optimize DNS resolution requests
DNS resolution is one of the most frequent network operations in a Kubernetes cluster. Many resolution requests can be optimized or avoided to reduce latency and load on the DNS infrastructure:
(Recommended) Use a connection pool: When a containerized application frequently requests the same service, use a connection pool to cache active connections to upstream services in memory. This eliminates the overhead of DNS resolution and TCP handshakes for each request.
Use asynchronous or the long-polling mode to retrieve IPs associated with domain names.
Use DNS caching:
(Recommended) If your application cannot be refactored to use connection pools, consider caching DNS resolution results on the application side. For instructions, see Improve stability with NodeLocal DNSCache.
If you cannot use NodeLocal DNSCache, use the built-in Name Service Cache Daemon (NSCD) within your container.
Optimize the resolv.conf file: The ndots and search parameters determine resolution efficiency. For details, see Configure DNS policies and resolve domain names.
Optimize domain configurations: Configure domain names following these principles to minimize resolution attempts and reduce latency:
When a pod accesses a Service in the same namespace, use
<service-name>.When a pod accesses a Service in a different namespace, use
<service-name>.<namespace-name>.When a pod accesses a domain outside the cluster, use a Fully Qualified Domain Name (FQDN) by appending a period (.). This forces the resolver to treat the name as absolute, skipping invalid searches through internal cluster domains. For example, use www.aliyun.com. instead of www.aliyun.com.
In clusters running Kubernetes 1.33 or later, set the
searchdomain to a single period (.) (see Issue 125883). This effectively turns all DNS requests into FQDN requests, preventing unnecessary search domain iterations.Example
dnsConfigsnippet:dnsPolicy: None dnsConfig: nameservers: ["192.168.0.10"] ## Replace with the actual ClusterIP of your CoreDNS service. searches: - . - default.svc.cluster.local ## Replace 'default' with your actual namespace. - svc.cluster.local - cluster.localResulting
/etc/resolv.conf:search . default.svc.cluster.local svc.cluster.local cluster.local nameserver 192.168.0.10With
.as the first search domain, the system immediately recognizes requests as FQDN, prioritizing self-resolution and eliminating invalid recursive searches.ImportantYou must set
dnsPolicytoNonefor the preceding configuration to take effect.
Notes about DNS configurations in containers
Different DNS resolvers may exhibit subtle behavioral variations due to their underlying implementations. You might encounter cases where
dig <domain>succeeds whileping <domain>fails.Avoid Alpine images: We strongly recommend using base images such as Debian or CentOS instead of Alpine Linux. The
musl libclibrary used in Alpine has several implementation differences compared to the standardglibc, leading to issues including but not limited to:TCP fallback: Alpine v3.18 and earlier do not support truncated (TC) bit fallback to TCP.
Search domains: Alpine v3.3 and earlier do not support the search parameter, which breaks service discovery within the cluster.
Optimization conflicts: Alpine concurrently requests all DNS servers listed in
/etc/resolv.conf, which may bypass and invalidate NodeLocal DNSCache optimizations.Conntrack race conditions: Concurrent A and AAAA record requests using the same socket can trigger source port conflicts in outdated Linux kernels, resulting in packet loss and resolution timeouts.
For more issues, see the musl libc documentation.
If you use a Go application, be aware of the differences between the
CGOandPure GoDNS resolvers.
Mitigate intermittent DNS timeouts in IPVS mode
When an ACK cluster uses IPVS as the kube-proxy load-balancing mode, you may experience intermittent DNS resolution timeouts during CoreDNS pod scaling or restarts. This is caused by a known Linux kernel defect. See IPVS commit.
To mitigate the impact of this IPVS flaw, use one of the following methods:
Improve stability with NodeLocal DNSCache
CoreDNS may occasionally encounter the following issues:
Rare packet loss from concurrent A and AAAA queries, leading to DNS failures.
Full node
conntracktable, causing packet loss and DNS failures.
To improve DNS stability and performance, install NodeLocal DNSCache. This add-on runs a DNS cache on each cluster node to handle DNS traffic locally.
After installing NodeLocal DNSCache, you must enable injection for your workloads. Run the following command to label a namespace. Any new pods created in this namespace will automatically use the DNS cache configuration.
kubectl label namespace default node-local-dns-injection=enabledFor detailed instructions and other injection methods, see Use NodeLocal DNSCache.
Maintain CoreDNS versions
CoreDNS maintains strong backward compatibility with Kubernetes. However, it is critical to keep CoreDNS updated to a stable version. Manage, upgrade, and configure CoreDNS via the Add-ons page.
If the ACK console shows an available upgrade for CoreDNS, schedule the upgrade as soon as possible during off-peak hours.
For upgrade instructions, see Automatic upgrade for unmanaged CoreDNS.
For CoreDNS release notes, see CoreDNS.
Risks in CoreDNS versions lower than 1.7.0
Logging crashes: If connectivity to the API server is interrupted (for example, due to API server restarts, migrations, or network jitter), CoreDNS restarts because it fails to write error logs. See Set klog's logtostderr flag.
OOM issues: CoreDNS consumes extra memory at startup. The default memory limit may trigger out-of-memory (OOM) issues in large-scale clusters. In severe cases, this may lead to restart loops. See CoreDNS uses a lot of memory during initialization phase.
Sync failures: CoreDNS has several issues that may affect resolution of headless service domain names and external domain names. For details, see plugin/kubernetes: handle tombstones in default processor and Data is not synced when CoreDNS reconnects to kubernetes api server after protracted disconnection.
If a cluster node becomes abnormal, the default toleration policy in some earlier CoreDNS versions may cause CoreDNS pods to be scheduled onto the abnormal node. These pods cannot be automatically evicted, leading to abnormal domain name resolution.
Recommended minimum CoreDNS versions
Cluster version | Recommended minimum CoreDNS version |
Below 1.14.8 | 1.6.2 (End of life) |
1.14.8 to 1.20.4 | 1.7.0.0-f59c03d-aliyun |
1.20.4 to 1.21.0 | 1.8.4.1-3a376cc-aliyun |
1.21.0 an above | 1.11.3.2-f57ea7ed6-aliyun |
Monitor CoreDNS operational status
Metrics and dashboards
CoreDNS exposes health and performance metrics through a standard Prometheus interface to detect exceptions on CoreDNS and upstream DNS servers.
ACK managed clusters:Managed Service for Prometheus provides built-in metrics monitoring dashboards and alerting rules for CoreDNS. You can enable Prometheus and dashboard features in the ACK console. See Monitor the CoreDNS component.
Self-managed Prometheus: Scrape CoreDNS metrics and configure alerts for critical indicators. See the official CoreDNS Prometheus documentation.
Log analysis
In the event of a DNS exception, CoreDNS logs are essential for root cause diagnosis. We recommend enabling DNS resolution logging and Simple Log Service (SLS) collection. For details, see Analyze and monitor CoreDNS logs.
Kubernetes event delivery
In CoreDNS v1.9.3.6-32932850-aliyun and later, use the k8s_event plugin to send critical logs to Event Center as Kubernetes events. See k8s_event.
Newly deployed CoreDNS enables this feature by default. If upgrading from an earlier version to v1.9.3.6-32932850-aliyun or later, manually modify the configuration file to enable it.
Edit the CoreDNS ConfigMap.
kubectl -n kube-system edit configmap/corednsAdd the
kubeAPIandk8s_eventplugins to the configuration.apiVersion: v1 data: Corefile: | .:53 { errors health { lameduck 15s } # --- Addition Start (ignore other differences) --- kubeapi k8s_event { level info error warning // Deliver key logs with the info, error, and warning statuses. } # --- Addition End --- kubernetes cluster.local in-addr.arpa ip6.arpa { pods verified fallthrough in-addr.arpa ip6.arpa } # (Remaining configuration omitted) }Verify the update by checking the CoreDNS pod logs. If the logs contain the word
reload, the modification is successful.
Ensure CoreDNS high availability
CoreDNS is the authoritative DNS provider for your cluster. Its stability is critical; a failure can result in service resolution errors and widespread application outages. This section explains how to monitor CoreDNS and implement high availability (HA) strategies.
Assess CoreDNS pressure
Use open-source tools such as DNSPerf to benchmark your DNS performance. If you cannot perform a customized assessment, follow these baseline recommendations:
Minimum replicas: Always maintain at least 2 pods for redundancy.
Resource limits: Set resource limits to at least 1 Core CPU and 1 GiB memory per pod.
Performance scaling: CoreDNS performance scales linearly with CPU. With NodeLocal DNSCache enabled, a single CPU core can typically handle over 10,000 QPS.
Replica ratio: If you cannot monitor peak CPU usage, use a conservative 1:8 Pod-to-Node ratio (add one CoreDNS pod for every 8 cluster nodes).
Scale CoreDNS pods
The number of CoreDNS pods determines available computing resources. Adjust the pod count based on the assessment results.
Due to UDP's lack of retransmission, if the IPVS UDP bug causes packet loss risk on cluster nodes, scaling in or restarting CoreDNS pods may cause cluster-wide DNS timeouts lasting up to five minutes. For mitigation strategies, see Troubleshoot DNS resolution errors.
Automatically scale based on the recommended policy
Deploy
dns-autoscalerto automatically adjust CoreDNS replicas based on the recommended policy (one pod for every eight cluster nodes).Formula:
replicas = max(ceil(cores × 1/coresPerReplica), ceil(nodes × 1/nodesPerReplica))This ensures that the 1:8 ratio is maintained as the cluster grows.
Manually adjust
Run the following command to manually adjust the number of CoreDNS pods:
kubectl scale --replicas={target} deployment/coredns -n kube-system # Replace {target} with the desired pod countDo not use HPA or CronHPA
Although workload auto-scaling mechanisms (such as HPA and CronHPA) can automatically adjust pod counts, they perform frequent scaling operations. Due to resolution exceptions caused by pod scale-in (the IPVS UDP issues mentioned above), do not use workload auto-scaling to control CoreDNS pod count.
Optimize CoreDNS pod specifications
Another way to adjust CoreDNS resources is by modifying pod specifications. In an ACK managed Pro cluster, the default memory limit for CoreDNS pods is 2 GiB, with no CPU limit. To ensure consistent performance, set the CPU limit to 4096m (minimum 1024m). You can adjust these resource requests and limits in the console.
Schedule CoreDNS pods
Proper scheduling and configuration are critical for the stability of CoreDNS. An improper setup can lead to DNS resolution failure, potentially causing cluster-wide service outages. Before performing this operation, ensure you're familiar with how scheduling works.
Recommended configurations
Multi-zone deployment: Always deploy CoreDNS replicas across different availability zones and nodes to prevent single points of failure.
Anti-affinity: CoreDNS versions earlier than 1.8.4.3 use a "preferred" (soft) node anti-affinity by default. If node resources are insufficient, multiple pods may be scheduled on the same node. If this occurs, upgrade the add-on or manually delete pods to trigger a re-schedule.
Lifecycle management: CoreDNS versions below 1.8 are end-of-life (EOL). Upgrade to the latest version as soon as possible.
Avoid resource exhaustion: Ensure that the nodes running CoreDNS are not saturated (high CPU/memory usage), because such saturation directly impacts DNS query latency and QPS.
Dedicated nodes: For large-scale clusters, consider using custom scheduling parameters (
tolerations/nodeAffinity) to host CoreDNS on dedicated nodes for maximum stability.
Optimize CoreDNS configurations
ACK provides a default CoreDNS configurations. However, you should tune these parameters based on your business requirements CoreDNS configuration is highly flexible. For details, see DNS policies and domain name resolution and the official CoreDNS documentation.
Recommended optimizations for legacy versions
If you are running an older cluster, check for the following risks:
Alternatively, check CoreDNS configuration files using the inspection and diagnostics features in the Container Intelligence Service. If the inspection indicates an abnormal CoreDNS ConfigMap, review the items listed above.
CoreDNS may consume extra memory when refreshing configuration. After modifying CoreDNS settings, monitor pod status. If pods experience OOM kills after a config change, increase the memory limit in the CoreDNS deployment (recommended: 2 GiB).
Disable the affinity configuration of the kube-dns service
The affinity configuration may cause significant load imbalances between CoreDNS replicas. Disable it by following these steps:
Console
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, find the cluster you want and click its name. In the left navigation pane, choose .
In the
kube-systemnamespace, click Edit YAML for the kube-dns service.If the value of the sessionAffinity field is
None, you do not need to perform the following steps.If the value of the sessionAffinity field is
ClientIP, perform the following steps.
Delete the sessionAffinity and sessionAffinityConfig fields and all their sub-keys. Then, click OK.
# Delete the following sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800Click Edit YAML to the right of the kube-dns Service again and verify that the sessionAffinity field is
None. If so, thekube-dnsService has been modified.
CLI
Run the following command to view the
kube-dnsService configuration.kubectl -n kube-system get svc kube-dns -o yamlIf the value of the sessionAffinity field is
None, you do not need to perform the following steps.If the value of the sessionAffinity field is
ClientIP, perform the following steps.
Run the following command to open and edit the
kube-dnsService.kubectl -n kube-system edit service kube-dnsDelete the sessionAffinity-related settings (sessionAffinity, sessionAffinityConfig, and all their sub-keys), then save and exit.
# Delete the following sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 10800After the modification, run the following command again to verify that the sessionAffinity field value is
None. If so, thekube-dnsService has been updated.kubectl -n kube-system get svc kube-dns -o yaml
Disable the autopath plugin
Some early CoreDNS versions enabled the autopath plugin, which can cause resolution errors in specific edge cases. Verify whether it is enabled and edit the configuration file to disable it. For more information, see Autopath issue #3765.
Disabling autopath can increase client-side QPS and resolution latency by up to 3 times. Monitor your CoreDNS load and service impact.
Run the
kubectl -n kube-system edit configmap corednscommand to open the CoreDNS configuration file.Remove the line
autopath @kubernetesand save the change.Check the running status and logs of the CoreDNS pod. If the logs contain the word
reload, the modification is successful.
Configure graceful shutdown
The lameduck mechanism ensures that when a CoreDNS process is about to terminate (during updates or restarts), it handles existing requests before exiting.
When the CoreDNS process is about to terminate, it enters
lameduckmode.In
lameduckmode, CoreDNS continues to process already-received requests for a specified duration while signaling the system to stop sending new requests.
Console
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of the one you want to change. In the left-side navigation pane, choose .
In the
kube-systemnamespace, click Edit YAML for the coredns ConfigMap.In the following CoreDNS ConfigMap, ensure the health plugin is enabled, set the lameduck timeout to
15s, then click OK.
.:53 {
errors
# The health plugin may have different settings in different CoreDNS versions.
# Scenario 1: The health plugin is not enabled by default.
# Scenario 2: The health plugin is enabled by default, but the lameduck time is not set.
# health
# Scenario 3: The health plugin is enabled by default, and the lameduck time is set to 5s.
# health {
# lameduck 5s
# }
# For the preceding three scenarios, modify the configuration as follows and set the lameduck parameter to 15s.
health {
lameduck 15s
}
# Other plugins do not need to be modified and are omitted here.
}A healthy CoreDNS pod status indicates a successful configuration update. If any anomalies occur, refer to the pod events and logs to identify the root cause.
CLI
Run the following command to open the CoreDNS configuration file.
Refer to the following Corefile, ensure the
healthplugin is enabled, and set the lameduck parameter to15s.After modifying the CoreDNS configuration file, save the change.
If CoreDNS runs normally, the graceful shutdown configuration has been successfully updated. If the CoreDNS pod is abnormal, diagnose the cause by reviewing pod events and logs.
kubectl -n kube-system edit configmap/coredns.:53 {
errors
# The health plugin may have different settings in different CoreDNS versions.
# Scenario 1: The health plugin is not enabled by default.
# Scenario 2: The health plugin is enabled by default, but the lameduck time is not set.
# health
# Scenario 3: The health plugin is enabled by default, and the lameduck time is set to 5s.
# health {
# lameduck 5s
# }
# For the preceding three scenarios, modify the configuration as follows and set the lameduck parameter to 15s.
health {
lameduck 15s
}
# Other plugins do not need to be modified and are omitted here.
}Optimize upstream protocol (prefer_udp)
When using NodeLocal DNSCache, the communication chain is: Application -> NodeLocal DNSCache (TCP) -> CoreDNS (TCP). By default, CoreDNS will then attempt to reach the upstream VPC DNS (100.100.2.136/138) via TCP.
Problem: VPC DNS has limited support for TCP.
Solution: Modify the
forwardplugin to useprefer_udp. This ensures CoreDNS communicates with upstream VPC DNS via UDP even if the incoming request was TCP. For more information, see Manage ConfigMaps.
# Before modification
forward . /etc/resolv.conf
# After modification
forward . /etc/resolv.conf {
prefer_udp
}Configure the ready plugin for readiness probe
For CoreDNS v1.5.0 and later, the ready plugin is mandatory for the readiness probe to function.
Run the following command to open the CoreDNS ConfigMap.
kubectl -n kube-system edit configmap/corednsCheck whether the file contains the
readydirective. If not, add it, press the Esc key, enter:wq!, then press Enter to save the modified configuration file and exit edit mode.apiVersion: v1 data: Corefile: | .:53 { errors health { lameduck 15s } ready # If this line does not exist, add it. Make sure that the indentation is consistent with Kubernetes. kubernetes cluster.local in-addr.arpa ip6.arpa { pods verified fallthrough in-addr.arpa ip6.arpa } prometheus :9153 forward . /etc/resolv.conf { max_concurrent 1000 prefer_udp } cache 30 loop log reload loadbalance }Check the running status and logs of the CoreDNS pod. If the logs contain the word
reload, the modification is successful.
Enhance resolution performance with the multisocket plugin
CoreDNS v1.12.1 introduced the multisocket plugin. Enable this plugin to allow CoreDNS to listen on the same port using multiple sockets, enhancing performance in high-CPU scenarios. For details, see the community documentation.
Enable multisocket in the coredns ConfigMap:
.:53 {
...
prometheus :9153
multisocket [NUM_SOCKETS]
forward . /etc/resolv.conf
...
}NUM_SOCKETS determines the number of sockets listened on the same port.
Recommended configuration: Align NUM_SOCKETS with estimated CPU utilization, CPU resource limits, and available cluster resources. Examples:
If peak consumption is 4 cores with 8 cores available, set
NUM_SOCKETSto 2.If peak consumption is 8 cores with 64 cores available, set
NUM_SOCKETSto 8.
To determine the optimal configuration, test QPS and load with different settings.
Default: If not specified, it defaults to GOMAXPROCS, which equals the CoreDNS pod's CPU limit. If the pod's CPU limit is not set, it equals the number of CPU cores on the node where the pod resides.

