Troubleshoot DNS resolution problems in AKS

2025-05-30

This article discusses how to create a troubleshooting workflow to fix Domain Name System (DNS) resolution problems in Microsoft Azure Kubernetes Service (AKS).

Prerequisites

The Kubernetes kubectl command-line tool

Note: To install kubectl by using Azure CLI, run the az aks install-cli command.
The dig command-line tool for DNS lookup
The grep command-line tool for text search
The Wireshark network packet analyzer

Troubleshooting checklist

Troubleshooting DNS problems in AKS is typically a complex process. You can easily get lost in the many different steps without ever seeing a clear path forward. To help make the process simpler and more effective, use the "scientific" method to organize the work:

Step 1. Collect the facts.
Step 2. Develop a hypothesis.
Step 3. Create and implement an action plan.
Step 4. Observe the results and draw conclusions.
Step 5. Repeat as necessary.

Troubleshooting Step 1: Collect the facts

To better understand the context of the problem, gather facts about the specific DNS problem. By using the following baseline questions as a starting point, you can describe the nature of the problem, recognize the symptoms, and identify the scope of the problem.

Question	Possible answers
Where does the DNS resolution fail?	Pod Node Both pods and nodes
What kind of DNS error do you get?	Time-out No such host Other DNS error
How often do the DNS errors occur?	Always Intermittently In a specific pattern
Which records are affected?	A specific ___domain Any ___domain
Do any custom DNS configurations exist?	Custom DNS server configured on the virtual network Custom DNS on CoreDNS configuration
What kind of performance problems are affecting the nodes?	CPU Memory I/O throttling
What kind of performance problems are affecting the CoreDNS pods?	CPU Memory I/O throttling
What causes DNS latency?	DNS queries that take too much time to receive a response (more the five seconds)

To get better answers to these questions, follow this three-part process.

Part 1: Generate tests at different levels that reproduce the problem

The DNS resolution process for pods on AKS includes many layers. Review these layers to isolate the problem. The following layers are typical:

CoreDNS pods
CoreDNS service
Nodes
Virtual network DNS

To start the process, run tests from a test pod against each layer.

Test the DNS resolution at CoreDNS pod level

Deploy a test pod to run DNS test queries by running the following command:

cat <<EOF | kubectl apply --filename -
apiVersion: v1
kind: Pod
metadata:
  name: aks-test
spec:
  containers:
  - name: aks-test
    image: debian:stable
    command: ["/bin/sh"]
    args: ["-c", "apt-get update && apt-get install -y dnsutils && while true; do sleep 1000; done"]
EOF

Retrieve the IP addresses of the CoreDNS pods by running the following kubectl get command:
```
kubectl get pod --namespace kube-system --selector k8s-app=kube-dns --output wide
```

Connect to the test pod using the kubectl exec -it aks-test -- bash command and test the DNS resolution against each CoreDNS pod IP address by running the following commands:

# Placeholder values
FQDN="<fully-qualified-___domain-name>"  # For example, "db.contoso.com"
DNS_SERVER="<coredns-pod-ip-address>"

# Test loop
for i in $(seq 1 1 10)
do
    echo "host= $(dig +short ${FQDN} @${DNS_SERVER})"
    sleep 1
done

For more information about troubleshooting DNS resolution problems from the pod level, see Troubleshoot DNS resolution failures from inside the pod.

Test the DNS resolution at CoreDNS service level

Retrieve the CoreDNS service IP address by running the following kubectl get command:
```
kubectl get service kube-dns --namespace kube-system
```

On the test pod, run the following commands against the CoreDNS service IP address:

# Placeholder values
FQDN="<fully-qualified-___domain-name>"  # For example, "db.contoso.com"
DNS_SERVER="<kubedns-service-ip-address>"

# Test loop
for i in $(seq 1 1 10)
do
    echo "host= $(dig +short ${FQDN} @${DNS_SERVER})"
    sleep 1
done

Test the DNS resolution at node level

Connect to the node.
Run the following grep command to retrieve the list of upstream DNS servers that are configured:
```
grep ^nameserver /etc/resolv.conf
```

Run the following text commands against each DNS that's configured in the node:

# Placeholder values
FQDN="<fully-qualified-___domain-name>"  # For example, "db.contoso.com"
DNS_SERVER="<dns-server-in-node-configuration>"

# Test loop
for i in $(seq 1 1 10)
do
    echo "host= $(dig +short ${FQDN} @${DNS_SERVER})"
    sleep 1
done

Test the DNS resolution at virtual network DNS level

Examine the DNS server configuration of the virtual network, and determine whether the servers can resolve the record in question.

Part 2: Review the health and performance of CoreDNS pods and nodes

Review the health and performance of CoreDNS pods

You can use kubectl commands to check the health and performance of CoreDNS pods. To do so, follow these steps:

Verify that the CoreDNS pods are running:

kubectl get pods -l k8s-app=kube-dns -n kube-system

Check if the CoreDNS pods are overused:

kubectl top pods -n kube-system -l k8s-app=kube-dns

NAME                      CPU(cores)   MEMORY(bytes)
coredns-dc97c5f55-424f7   3m           23Mi
coredns-dc97c5f55-wbh4q   3m           25Mi

Get the nodes that host the CoreDNS pods:

kubectl get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'

Verify that the nodes aren't overused:
```
kubectl top nodes
```

Verify the logs for the CoreDNS pods:

kubectl logs -l k8s-app=kube-dns -n kube-system

Note

To get more debugging information, enable verbose logs in CoreDNS. To do so, see Troubleshooting CoreDNS customization in AKS.

Review the health and performance of nodes

You might first notice DNS resolution performance problems as intermittent errors, such as time-outs. The main causes of this problem include resource exhaustion and I/O throttling within nodes that host the CoreDNS pods or the client pod.

To check whether resource exhaustion or I/O throttling is occurring, run the following kubectl describe command together with the grep command on your nodes. This series of commands lets you review the request count and compare it against the limit for each resource. If the limit percentage is more than 100 percent for a resource, that resource is overcommitted.

kubectl describe node | grep -A5 '^Name:\|^Allocated resources:' | grep -v '.kubernetes.io\|^Roles:\|Labels:'

The following snippet shows example output from this command:

Name:               aks-nodepool1-17046773-vmss00000m
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                250m (13 percent)  40m (2 percent)
  memory             420Mi (9 percent)  1762Mi (41 percent)
--
Name:               aks-nodepool1-17046773-vmss00000n
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests            Limits
  --------           --------            ------
  cpu                612m (32 percent)   8532m (449 percent)
  memory             804Mi (18 percent)  6044Mi (140 percent)
--
Name:               aks-nodepool1-17046773-vmss00000o
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                250m (13 percent)  40m (2 percent)
  memory             420Mi (9 percent)  1762Mi (41 percent)
--
Name:               aks-ubuntu-16984727-vmss000008
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests            Limits
  --------           --------            ------
  cpu                250m (13 percent)   40m (2 percent)
  memory             420Mi (19 percent)  1762Mi (82 percent)

To get a better picture of resource usage at the pod and node level, you can also use Container insights and other cloud-native tools in Azure. For more information, see Monitor Kubernetes clusters using Azure services and cloud native tools.

Part 3: Analyze DNS traffic and review DNS resolution performance

Analyzing DNS traffic can help you understand how your AKS cluster handles the DNS queries. Ideally, you should reproduce the problem on a test pod while you capture the traffic from this test pod and on each of the CoreDNS pods.

There are two main ways to analyze DNS traffic:

Using real-time DNS analysis tools, such as Inspektor Gadget, to analyze the DNS traffic in real time.
Using traffic capture tools, such as Retina Capture and Dumpy, to collect the DNS traffic and analyze it with a network packet analyzer tool, such as Wireshark.

Both approaches aim to understand the health and performance of DNS responses using DNS response codes, response times, and other metrics. Choose the one that fits your needs best.

Real-time DNS traffic analysis

You can use Inspektor Gadget to analyze the DNS traffic in real time. To install Inspektor Gadget to your cluster, see How to install Inspektor Gadget in an AKS cluster.

To trace DNS traffic across all namespaces, use the following command:

# Get the version of Gadget
GADGET_VERSION=$(kubectl gadget version | grep Server | awk '{print $3}')
# Run the trace_dns gadget
kubectl gadget run trace_dns:$GADGET_VERSION --all-namespaces --fields "src,dst,name,qr,qtype,id,rcode,latency_ns"

Where --fields is a comma-separated list of fields to be displayed. The following list describes the fields that are used in the command:

src: The source of the request with Kubernetes information (<kind>/<namespace>/<name>:<port>).
dst: The destination of the request with Kubernetes information (<kind>/<namespace>/<name>:<port>).
name: The name of the DNS request.
qr: The query/response flag.
qtype: The type of the DNS request.
id: The ID of the DNS request, which is used to match the request and response.
rcode: The response code of the DNS request.
latency_ns: The latency of the DNS request.

The command output looks like the following:

SRC                                  DST                                  NAME                        QR QTYPE          ID             RCODE           LATENCY_NS
p/default/aks-test:33141             p/kube-system/coredns-57d886c994-r2… db.contoso.com.             Q  A              c215                                  0ns
p/kube-system/coredns-57d886c994-r2… 168.63.129.16:53                     db.contoso.com.             Q  A              323c                                  0ns
168.63.129.16:53                     p/kube-system/coredns-57d886c994-r2… db.contoso.com.             R  A              323c           NameErr…           13.64ms
p/kube-system/coredns-57d886c994-r2… p/default/aks-test:33141             db.contoso.com.             R  A              c215           NameErr…               0ns
p/default/aks-test:56921             p/kube-system/coredns-57d886c994-r2… db.contoso.com.             Q  A              6574                                  0ns
p/kube-system/coredns-57d886c994-r2… p/default/aks-test:56921             db.contoso.com.             R  A              6574           NameErr…               0ns

You can use the ID field to identify whether a query has a response. The RCODE field shows you the response code of the DNS request. The LATENCY_NS field shows you the latency of the DNS request in nanoseconds. These fields can help you understand the health and performance of DNS responses. For more information about real-time DNS analysis, see Troubleshoot DNS failures across an AKS cluster in real time.

Capture DNS traffic

This section demonstrates how to use Dumpy to collect DNS traffic captures from each CoreDNS pod and a client DNS pod (in this case, the aks-test pod).

To collect the captures from the test client pod, run the following command:

kubectl dumpy capture pod aks-test -f "-i any port 53" --name dns-cap1-aks-test

To collect captures for the CoreDNS pods, run the following Dumpy command:

kubectl dumpy capture deploy coredns \
    -n kube-system \
    -f "-i any port 53" \
    --name dns-cap1-coredns

Ideally, you should be running captures while the problem reproduces. This requirement means that different captures might be running for different amounts of time, depending on how often you can reproduce the problem. To collect the captures, run the following commands:

mkdir dns-captures
kubectl dumpy export dns-cap1-aks-test ./dns-captures
kubectl dumpy export dns-cap1-coredns ./dns-captures -n kube-system

To delete the Dumpy pods, run the following Dumpy command:

kubectl dumpy delete dns-cap1-coredns -n kube-system
kubectl dumpy delete dns-cap1-aks-test

To merge all the CoreDNS pod captures, use the mergecap command line tool for merging capture files. The mergecap tool is included in the Wireshark network packet analyzer tool. Run the following mergecap command:

mergecap -w coredns-cap1.pcap dns-cap1-coredns-<coredns-pod-name-1>.pcap dns-cap1-coredns-<coredns-pod-name-2>.pcap [...]

DNS packet analysis for an individual CoreDNS pod

After you generate and merge your traffic capture files, you can do a DNS packet analysis of the capture files in Wireshark. Follow these steps to view the packet analysis for the traffic of an individual CoreDNS pod:

Select Start, enter Wireshark, and then select Wireshark in the search results.
In the Wireshark window, select the File menu, and then select Open.
Navigate to the folder that contains your capture files, select dns-cap1-db-check-<db-check-pod-name>.pcap (the client-side capture file for an individual CoreDNS pod), and then select the Open button.

Select the Statistics menu, and then select DNS. The Wireshark - DNS dialog box appears and displays an analysis of the DNS traffic. The contents of the dialog box are shown in the following table.

Topic / Item	Count	Average	Min Val	Max Val	Rate (ms)	Percent	Burst Rate	Burst Start
▾ Total Packets	1066				0.0017	100%	0.1200	0.000
▾ rcode	1066				0.0017	100.00%	0.1200	0.000
Server failure	17				0.0000	1.59%	0.0100	99.332
No such name	353				0.0006	33.11%	0.0400	0.000
No error	696				0.0011	65.29%	0.0800	0.000
▾ opcodes	1066				0.0017	100.00%	0.1200	0.000
Standard query	1066				0.0017	100.00%	0.1200	0.000
▾ Query/Response	1066				0.0017	100.00%	0.1200	0.000
Response	531				0.0009	49.81%	0.0600	0.000
Query	535				0.0009	50.19%	0.0600	0.000
▾ Query Type	1066				0.0017	100.00%	0.1200	0.000
AAAA	167				0.0003	15.67%	0.0200	0.000
A	899				0.0015	84.33%	0.1000	0.000
▾ Class	1066				0.0017	100.00%	0.1200	0.000
IN	1066				0.0017	100.00%	0.1200	0.000
▾ Service Stats	0				0.0000	100%	-	-
request-response time (msec)	531	184.42	0.067000	6308.503906	0.0009		0.0600	0.000
no. of unsolicited responses	0				0.0000		-	-
no. of retransmissions	0				0.0000		-	-
▾ Response Stats	0				0.0000	100%	-	-
no. of questions	1062	1.00	1	1	0.0017		0.1200	0.000
no. of authorities	1062	0.82	0	1	0.0017		0.1200	0.000
no. of answers	1062	0.15	0	1	0.0017		0.1200	0.000
no. of additionals	1062	0.00	0	0	0.0017		0.1200	0.000
▾ Query Stats	0				0.0000	100%	-	-
Qname Len	535	32.99	14	66	0.0009		0.0600	0.000
▾ Label Stats	0				0.0000		-	-
4th Level or more	365				0.0006		0.0400	0.000
3rd Level	170				0.0003		0.0200	0.000
2nd Level	0				0.0000		-	-
1st Level	0				0.0000		-	-
Payload size	1066	92.87	32	194	0.0017	100%	0.1200	0.000

The DNS analysis dialog box in Wireshark shows a total of 1,066 packets. Of these packets, 17 (1.59 percent) caused a server failure. Additionally, the maximum response time was 6,308 milliseconds (6.3 seconds), and no response was received for 0.38 percent of the queries. (This total was calculated by subtracting the 49.81 percent of packets that contained responses from the 50.19 percent of packets that contained queries.)

If you enter (dns.flags.response == 0) && ! dns.response_in as a display filter in Wireshark, this filter displays DNS queries that didn't receive a response, as shown in the following table.

No.	Time	Source	Destination	Protocol	Length	Info
225	2024-04-01 16:50:40.000520	10.0.0.21	172.16.0.10	DNS	80	Standard query 0x2c67 AAAA db.contoso.com
426	2024-04-01 16:52:47.419907	10.0.0.21	172.16.0.10	DNS	132	Standard query 0x8038 A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net
693	2024-04-01 16:55:23.105558	10.0.0.21	172.16.0.10	DNS	132	Standard query 0xbcb0 A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net
768	2024-04-01 16:56:06.512464	10.0.0.21	172.16.0.10	DNS	80	Standard query 0xe330 A db.contoso.com

Additionally, the Wireshark status bar displays the text Packets: 1066 - Displayed: 4 (0.4%). This information means that four of the 1,066 packets, or 0.4 percent, were DNS queries that never received a response. This percentage essentially matches the 0.38 percent total that you calculated earlier.

If you change the display filter to dns.time >= 5, the filter shows query response packets that took five seconds or more to be received, as shown in the updated table.

No.	Time	Source	Destination	Protocol	Length	Info	SourcePort	dns resp time
213	2024-04-01 16:50:32.644592	172.16.0.10	10.0.0.21	DNS	132	Standard query 0x9312 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net	53	6.229941
320	2024-04-01 16:51:55.053896	172.16.0.10	10.0.0.21	DNS	80	Standard query 0xe5ce Server failure A db.contoso.com	53	6.065555
328	2024-04-01 16:51:55.113619	172.16.0.10	10.0.0.21	DNS	132	Standard query 0x6681 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net	53	6.029641
335	2024-04-01 16:52:02.553811	172.16.0.10	10.0.0.21	DNS	80	Standard query 0x6cf6 Server failure A db.contoso.com	53	6.500504
541	2024-04-01 16:53:53.423838	172.16.0.10	10.0.0.21	DNS	80	Standard query 0x07b3 Server failure AAAA db.contoso.com	53	6.022195
553	2024-04-01 16:54:05.165234	172.16.0.10	10.0.0.21	DNS	132	Standard query 0x1ea0 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net	53	6.007022
774	2024-04-01 16:56:17.553531	172.16.0.10	10.0.0.21	DNS	80	Standard query 0xa20f Server failure AAAA db.contoso.com	53	6.014926
891	2024-04-01 16:56:44.442334	172.16.0.10	10.0.0.21	DNS	132	Standard query 0xa279 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net	53	6.044552

After you change the display filter, the status bar is updated to show the text Packets: 1066 - Displayed: 8 (0.8%). Therefore, eight of the 1,066 packets, or 0.8 percent, were DNS responses that took five seconds or more to be received. However, on most clients, the default DNS time-out value is expected to be five seconds. This expectation means that, although the CoreDNS pods processed and delivered the eight responses, the client already ended the session by issuing a "timed out" error message. The Info column in the filtered results shows that all eight packets caused a server failure.

DNS packet analysis for all CoreDNS pods

In Wireshark, open the capture file of the CoreDNS pods that you merged earlier (coredns-cap1.pcap), and then open the DNS analysis, as described in the previous section. A Wireshark dialog box appears that displays the following table.

Topic / Item	Count	Average	Min Val	Max Val	Rate (ms)	Percent	Burst Rate	Burst Start
▾ Total Packets	4540233				7.3387	100%	84.7800	592.950
▾ rcode	4540233				7.3387	100.00%	84.7800	592.950
Server failure	121781				0.1968	2.68%	8.4600	599.143
No such name	574658				0.9289	12.66%	10.9800	592.950
No error	3843794				6.2130	84.66%	73.2500	592.950
▾ opcodes	4540233				7.3387	100.00%	84.7800	592.950
Standard query	4540233				7.3387	100.00%	84.7800	592.950
▾ Query/Response	4540233				7.3387	100.00%	84.7800	592.950
Response	2135116				3.4512	47.03%	39.0400	581.680
Query	2405117				3.8876	52.97%	49.1400	592.950
▾ Query Type	4540233				7.3387	100.00%	84.7800	592.950
SRV	3647				0.0059	0.08%	0.1800	586.638
PTR	554630				0.8965	12.22%	11.5400	592.950
NS	15918				0.0257	0.35%	0.7200	308.225
MX	393016				0.6353	8.66%	7.9700	426.930
AAAA	384032				0.6207	8.46%	8.4700	438.155
A	3188990				5.1546	70.24%	57.9600	592.950
▾ Class	4540233				7.3387	100.00%	84.7800	592.950
IN	4540233				7.3387	100.00%	84.7800	592.950
▾ Service Stats	0				0.0000	100%	-	-
request-response time (msec)	2109677	277.18	0.020000	12000.532227	3.4100		38.0100	581.680
no. of unsolicited responses	25402				0.0411		5.1400	587.832
no. of retransmissions	37				0.0001		0.0300	275.702
▾ Response Stats	0				0.0000	100%	-	-
no. of questions	4244830	1.00	1	1	6.8612		77.0500	581.680
no. of authorities	4244830	0.39	0	11	6.8612		77.0500	581.680
no. of answers	4244830	1.60	0	22	6.8612		77.0500	581.680
no. of additionals	4244830	0.29	0	26	6.8612		77.0500	581.680
▾ Query Stats	0				0.0000	100%	-	-
Qname Len	2405117	20.42	2	113	3.8876		49.1400	592.950
▾ Label Stats	0				0.0000		-	-
4th Level or more	836034				1.3513		16.1900	592.950
3rd Level	1159513				1.8742		23.8900	592.950
2nd Level	374182				0.6048		8.7800	592.955
1st Level	35388				0.0572		0.9200	294.492
Payload size	4540233	89.87	17	1128	7.3387	100%	84.7800	592.950

The dialog box indicates that there were a combined total of about 4.5 million (4,540,233) packets, of which 2.68 percent caused server failure. The difference in query and response packet percentages shows that 5.94 percent of the queries (52.97 percent minus 47.03 percent) didn't receive a response. The maximum response time was 12 seconds (12,000.532227 milliseconds).

If you apply the display filter for query responses that took five seconds or more (dns.time >= 5), most of the packets in the filter results will indicate that a server failure occurred. This is probably because of a client "timed out" error.

The following table is a summary of the capture findings.

Capture review criteria	Yes	No
Difference between DNS queries and responses exceeds two percent	☑	☐
DNS latency is more than one second	☑	☐

Troubleshooting Step 2: Develop a hypothesis

This section categorizes common problem types to help you narrow down potential problems and identify components that might require adjustments. This approach sets the foundation for creating a targeted action plan to mitigate and resolve these problems effectively.

Common DNS response codes

The following table summarizes the most common DNS return codes.

DNS return code	DNS return message	Description
`RCODE:0`	`NOERROR`	The DNS query finished successfully.
`RCODE:1`	`FORMERR`	A DNS query format error exists.
`RCODE:2`	`SERVFAIL`	The server didn't complete the DNS request.
`RCODE:3`	`NXDOMAIN`	The ___domain name doesn't exist.
`RCODE:5`	`REFUSED`	The server refused to answer the query.
`RCODE:8`	`NOTAUTH`	The server isn't authoritative for the zone.

General problem types

The following table lists problem type categories that help you break down the problem symptoms.

Problem type	Description
Performance	DNS resolution performance problems can cause intermittent errors, such as "timed out" errors from a client's perspective. These problems might occur because nodes experience resource exhaustion or I/O throttling. Additionally, constraints on compute resources in CoreDNS pods can cause resolution latency. If CoreDNS latency is high or increases over time, this might indicate a load problem. If CoreDNS instances are overloaded, you might experience DNS name resolution problems and delays, or you might see problems in workloads and Kubernetes internal services.
Configuration	Configuration problems can cause incorrect DNS resolution. In this case, you might experience `NXDOMAIN` or "timed out" errors. Incorrect configurations might occur in CoreDNS, nodes, Kubernetes, routing, virtual network DNS, private DNS zones, firewalls, proxies, and so on.
Network connectivity	Network connectivity problems can affect pod-to-pod connectivity (east-west traffic) or pod-and-node connectivity to external resources (north-south traffic). This scenario can cause "timed out" errors. The connectivity problems might occur if the CoreDNS service endpoints aren't up to date (for example, because of kube-proxy problems, routing problems, packet loss, and so on). External resource dependency combined with connectivity problems (for example, dependency on custom DNS servers or external DNS servers) can also contribute to the problem.

Required inputs

Before you formulate a hypothesis of probable causes for the problem, summarize the results from the previous steps of the troubleshooting workflow.

You can collect the results by using the following tables.

Results of the baseline questionnaire template

Question	Possible answers
Where does the DNS resolution fail?	☐ Pod ☐ Node ☐ Both pod and node
What type of DNS error do you get?	☐ Timed out ☐ `NXDOMAIN` ☐ Other DNS error
How often do the DNS errors occur?	☐ Always ☐ Intermittently ☐ In a specific pattern
Which records are affected?	☐ A specific ___domain ☐ Any ___domain
Do any custom DNS configurations exist?	☐ Custom DNS servers on a virtual network ☐ Custom CoreDNS configuration

Results of tests at different levels

Resolution test results	Works	Fails
From pod to CoreDNS service	☐	☐
From pod to CoreDNS pod IP address	☐	☐
From pod to Azure internal DNS	☐	☐
From pod to virtual network DNS	☐	☐
From node to Azure internal DNS	☐	☐
From node to virtual network DNS	☐	☐

Results of health and performance of the nodes and the CoreDNS pods

Performance review results	Healthy	Unhealthy
Nodes performance	☐	☐
CoreDNS pods performance	☐	☐

Results of traffic captures and DNS resolution performance

Capture review criteria	Yes	No
Difference between DNS queries and responses exceeds two percent	☐	☐
DNS latency is more than one second	☐	☐

Map required inputs to problem types

To develop your first hypothesis, map each of the results from the required inputs to one or more of the problem types. By analyzing these results in the context of problem types, you can develop hypotheses about the potential root causes of the DNS resolution problems. Then, you can create an action plan of targeted investigation and troubleshooting.

Error type mapping pointers

If test results show DNS resolution failures at the CoreDNS service, or contain "timed out" errors when trying to reach specific endpoints, then configuration or connectivity problems might exist.
Indications of compute resource starvation at CoreDNS pod or node levels might suggest performance problems.
DNS captures that have a considerable mismatch between DNS queries and DNS responses can indicate that packets are being lost. This scenario suggests that there are connectivity or performance problems.
The presence of custom configurations at the virtual network level or Kubernetes level can contain setups that don't work with AKS and CoreDNS as expected.

Troubleshooting Step 3: Create and implement an action plan

You should now have enough information to create and implement an action plan. The following sections contain extra recommendations to formulate your plan for specific problem types.

Performance problems

If you're dealing with DNS resolution performance problems, review and implement the following best practices and guidance.

Best practice	Guidance
Set up a dedicated system node pool that meets minimum sizing requirements.	Manage system node pools in Azure Kubernetes Service (AKS)
To avoid disk I/O throttling, use nodes that have Ephemeral OS disks.	Default OS disk sizing and GitHub issue 1373 in Azure AKS
Follow best resource management practices on workloads within the nodes.	Best practices for application developers to manage resources in Azure Kubernetes Service (AKS)

If DNS performance still isn't good enough after you make these changes, consider using Node Local DNS.

Configuration problems

Depending on the component, you should review and understand the implications of the specific setup. See the following list of component-specific documentation for configuration details:

Network connectivity problems

Bugs that involve the Container Networking Interface (CNI) or other Kubernetes or OS components usually require intervention from AKS support or the AKS product group.
Infrastructure problems, such as hardware failures or hypervisor problems, might require collaboration from infrastructure support teams. Alternatively, these problems might have self-healing features.

Troubleshooting Step 4: Observe results and draw conclusions

Observe the results of implementing your action plan. At this point, your action plan should be able to fix or mitigate the problem.

Troubleshooting Step 5: Repeat as necessary

If these troubleshooting steps don't resolve the problem, repeat the troubleshooting steps as necessary.

Third-party information disclaimer

The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, about the performance or reliability of these products.

Third-party contact disclaimer

Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.

Share via