Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article discusses how to create a troubleshooting workflow to fix Domain Name System (DNS) resolution problems in Microsoft Azure Kubernetes Service (AKS).
Prerequisites
The Kubernetes kubectl command-line tool
Note: To install kubectl by using Azure CLI, run the az aks install-cli command.
The dig command-line tool for DNS lookup
The grep command-line tool for text search
The Wireshark network packet analyzer
Troubleshooting checklist
Troubleshooting DNS problems in AKS is typically a complex process. You can easily get lost in the many different steps without ever seeing a clear path forward. To help make the process simpler and more effective, use the "scientific" method to organize the work:
Step 1. Collect the facts.
Step 2. Develop a hypothesis.
Step 3. Create and implement an action plan.
Step 4. Observe the results and draw conclusions.
Step 5. Repeat as necessary.
Troubleshooting Step 1: Collect the facts
To better understand the context of the problem, gather facts about the specific DNS problem. By using the following baseline questions as a starting point, you can describe the nature of the problem, recognize the symptoms, and identify the scope of the problem.
Question | Possible answers |
---|---|
Where does the DNS resolution fail? |
|
What kind of DNS error do you get? |
|
How often do the DNS errors occur? |
|
Which records are affected? |
|
Do any custom DNS configurations exist? |
|
What kind of performance problems are affecting the nodes? |
|
What kind of performance problems are affecting the CoreDNS pods? |
|
What causes DNS latency? | DNS queries that take too much time to receive a response (more the five seconds) |
To get better answers to these questions, follow this three-part process.
Part 1: Generate tests at different levels that reproduce the problem
The DNS resolution process for pods on AKS includes many layers. Review these layers to isolate the problem. The following layers are typical:
- CoreDNS pods
- CoreDNS service
- Nodes
- Virtual network DNS
To start the process, run tests from a test pod against each layer.
Test the DNS resolution at CoreDNS pod level
Deploy a test pod to run DNS test queries by running the following command:
cat <<EOF | kubectl apply --filename - apiVersion: v1 kind: Pod metadata: name: aks-test spec: containers: - name: aks-test image: debian:stable command: ["/bin/sh"] args: ["-c", "apt-get update && apt-get install -y dnsutils && while true; do sleep 1000; done"] EOF
Retrieve the IP addresses of the CoreDNS pods by running the following kubectl get command:
kubectl get pod --namespace kube-system --selector k8s-app=kube-dns --output wide
Connect to the test pod using the
kubectl exec -it aks-test -- bash
command and test the DNS resolution against each CoreDNS pod IP address by running the following commands:# Placeholder values FQDN="<fully-qualified-___domain-name>" # For example, "db.contoso.com" DNS_SERVER="<coredns-pod-ip-address>" # Test loop for i in $(seq 1 1 10) do echo "host= $(dig +short ${FQDN} @${DNS_SERVER})" sleep 1 done
For more information about troubleshooting DNS resolution problems from the pod level, see Troubleshoot DNS resolution failures from inside the pod.
Test the DNS resolution at CoreDNS service level
Retrieve the CoreDNS service IP address by running the following
kubectl get
command:kubectl get service kube-dns --namespace kube-system
On the test pod, run the following commands against the CoreDNS service IP address:
# Placeholder values FQDN="<fully-qualified-___domain-name>" # For example, "db.contoso.com" DNS_SERVER="<kubedns-service-ip-address>" # Test loop for i in $(seq 1 1 10) do echo "host= $(dig +short ${FQDN} @${DNS_SERVER})" sleep 1 done
Test the DNS resolution at node level
Connect to the node.
Run the following
grep
command to retrieve the list of upstream DNS servers that are configured:grep ^nameserver /etc/resolv.conf
Run the following text commands against each DNS that's configured in the node:
# Placeholder values FQDN="<fully-qualified-___domain-name>" # For example, "db.contoso.com" DNS_SERVER="<dns-server-in-node-configuration>" # Test loop for i in $(seq 1 1 10) do echo "host= $(dig +short ${FQDN} @${DNS_SERVER})" sleep 1 done
Test the DNS resolution at virtual network DNS level
Examine the DNS server configuration of the virtual network, and determine whether the servers can resolve the record in question.
Part 2: Review the health and performance of CoreDNS pods and nodes
Review the health and performance of CoreDNS pods
You can use kubectl
commands to check the health and performance of CoreDNS pods. To do so, follow these steps:
Verify that the CoreDNS pods are running:
kubectl get pods -l k8s-app=kube-dns -n kube-system
Check if the CoreDNS pods are overused:
kubectl top pods -n kube-system -l k8s-app=kube-dns
NAME CPU(cores) MEMORY(bytes) coredns-dc97c5f55-424f7 3m 23Mi coredns-dc97c5f55-wbh4q 3m 25Mi
Get the nodes that host the CoreDNS pods:
kubectl get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'
Verify that the nodes aren't overused:
kubectl top nodes
Verify the logs for the CoreDNS pods:
kubectl logs -l k8s-app=kube-dns -n kube-system
Note
To get more debugging information, enable verbose logs in CoreDNS. To do so, see Troubleshooting CoreDNS customization in AKS.
Review the health and performance of nodes
You might first notice DNS resolution performance problems as intermittent errors, such as time-outs. The main causes of this problem include resource exhaustion and I/O throttling within nodes that host the CoreDNS pods or the client pod.
To check whether resource exhaustion or I/O throttling is occurring, run the following kubectl describe command together with the grep
command on your nodes. This series of commands lets you review the request count and compare it against the limit for each resource. If the limit percentage is more than 100 percent for a resource, that resource is overcommitted.
kubectl describe node | grep -A5 '^Name:\|^Allocated resources:' | grep -v '.kubernetes.io\|^Roles:\|Labels:'
The following snippet shows example output from this command:
Name: aks-nodepool1-17046773-vmss00000m
--
Allocated resources:
(Total limits might be more than 100 percent.)
Resource Requests Limits
-------- -------- ------
cpu 250m (13 percent) 40m (2 percent)
memory 420Mi (9 percent) 1762Mi (41 percent)
--
Name: aks-nodepool1-17046773-vmss00000n
--
Allocated resources:
(Total limits might be more than 100 percent.)
Resource Requests Limits
-------- -------- ------
cpu 612m (32 percent) 8532m (449 percent)
memory 804Mi (18 percent) 6044Mi (140 percent)
--
Name: aks-nodepool1-17046773-vmss00000o
--
Allocated resources:
(Total limits might be more than 100 percent.)
Resource Requests Limits
-------- -------- ------
cpu 250m (13 percent) 40m (2 percent)
memory 420Mi (9 percent) 1762Mi (41 percent)
--
Name: aks-ubuntu-16984727-vmss000008
--
Allocated resources:
(Total limits might be more than 100 percent.)
Resource Requests Limits
-------- -------- ------
cpu 250m (13 percent) 40m (2 percent)
memory 420Mi (19 percent) 1762Mi (82 percent)
To get a better picture of resource usage at the pod and node level, you can also use Container insights and other cloud-native tools in Azure. For more information, see Monitor Kubernetes clusters using Azure services and cloud native tools.
Part 3: Analyze DNS traffic and review DNS resolution performance
Analyzing DNS traffic can help you understand how your AKS cluster handles the DNS queries. Ideally, you should reproduce the problem on a test pod while you capture the traffic from this test pod and on each of the CoreDNS pods.
There are two main ways to analyze DNS traffic:
- Using real-time DNS analysis tools, such as Inspektor Gadget, to analyze the DNS traffic in real time.
- Using traffic capture tools, such as Retina Capture and Dumpy, to collect the DNS traffic and analyze it with a network packet analyzer tool, such as Wireshark.
Both approaches aim to understand the health and performance of DNS responses using DNS response codes, response times, and other metrics. Choose the one that fits your needs best.
Real-time DNS traffic analysis
You can use Inspektor Gadget to analyze the DNS traffic in real time. To install Inspektor Gadget to your cluster, see How to install Inspektor Gadget in an AKS cluster.
To trace DNS traffic across all namespaces, use the following command:
# Get the version of Gadget
GADGET_VERSION=$(kubectl gadget version | grep Server | awk '{print $3}')
# Run the trace_dns gadget
kubectl gadget run trace_dns:$GADGET_VERSION --all-namespaces --fields "src,dst,name,qr,qtype,id,rcode,latency_ns"
Where --fields
is a comma-separated list of fields to be displayed. The following list describes the fields that are used in the command:
src
: The source of the request with Kubernetes information (<kind>/<namespace>/<name>:<port>
).dst
: The destination of the request with Kubernetes information (<kind>/<namespace>/<name>:<port>
).name
: The name of the DNS request.qr
: The query/response flag.qtype
: The type of the DNS request.id
: The ID of the DNS request, which is used to match the request and response.rcode
: The response code of the DNS request.latency_ns
: The latency of the DNS request.
The command output looks like the following:
SRC DST NAME QR QTYPE ID RCODE LATENCY_NS
p/default/aks-test:33141 p/kube-system/coredns-57d886c994-r2… db.contoso.com. Q A c215 0ns
p/kube-system/coredns-57d886c994-r2… 168.63.129.16:53 db.contoso.com. Q A 323c 0ns
168.63.129.16:53 p/kube-system/coredns-57d886c994-r2… db.contoso.com. R A 323c NameErr… 13.64ms
p/kube-system/coredns-57d886c994-r2… p/default/aks-test:33141 db.contoso.com. R A c215 NameErr… 0ns
p/default/aks-test:56921 p/kube-system/coredns-57d886c994-r2… db.contoso.com. Q A 6574 0ns
p/kube-system/coredns-57d886c994-r2… p/default/aks-test:56921 db.contoso.com. R A 6574 NameErr… 0ns
You can use the ID
field to identify whether a query has a response. The RCODE
field shows you the response code of the DNS request. The LATENCY_NS
field shows you the latency of the DNS request in nanoseconds. These fields can help you understand the health and performance of DNS responses.
For more information about real-time DNS analysis, see Troubleshoot DNS failures across an AKS cluster in real time.
Capture DNS traffic
This section demonstrates how to use Dumpy to collect DNS traffic captures from each CoreDNS pod and a client DNS pod (in this case, the aks-test
pod).
To collect the captures from the test client pod, run the following command:
kubectl dumpy capture pod aks-test -f "-i any port 53" --name dns-cap1-aks-test
To collect captures for the CoreDNS pods, run the following Dumpy command:
kubectl dumpy capture deploy coredns \
-n kube-system \
-f "-i any port 53" \
--name dns-cap1-coredns
Ideally, you should be running captures while the problem reproduces. This requirement means that different captures might be running for different amounts of time, depending on how often you can reproduce the problem. To collect the captures, run the following commands:
mkdir dns-captures
kubectl dumpy export dns-cap1-aks-test ./dns-captures
kubectl dumpy export dns-cap1-coredns ./dns-captures -n kube-system
To delete the Dumpy pods, run the following Dumpy command:
kubectl dumpy delete dns-cap1-coredns -n kube-system
kubectl dumpy delete dns-cap1-aks-test
To merge all the CoreDNS pod captures, use the mergecap command line tool for merging capture files. The mergecap
tool is included in the Wireshark network packet analyzer tool. Run the following mergecap
command:
mergecap -w coredns-cap1.pcap dns-cap1-coredns-<coredns-pod-name-1>.pcap dns-cap1-coredns-<coredns-pod-name-2>.pcap [...]
DNS packet analysis for an individual CoreDNS pod
After you generate and merge your traffic capture files, you can do a DNS packet analysis of the capture files in Wireshark. Follow these steps to view the packet analysis for the traffic of an individual CoreDNS pod:
Select Start, enter Wireshark, and then select Wireshark in the search results.
In the Wireshark window, select the File menu, and then select Open.
Navigate to the folder that contains your capture files, select dns-cap1-db-check-<db-check-pod-name>.pcap (the client-side capture file for an individual CoreDNS pod), and then select the Open button.
Select the Statistics menu, and then select DNS. The Wireshark - DNS dialog box appears and displays an analysis of the DNS traffic. The contents of the dialog box are shown in the following table.
Topic / Item Count Average Min Val Max Val Rate (ms) Percent Burst Rate Burst Start ▾ Total Packets 1066 0.0017 100% 0.1200 0.000 ▾ rcode 1066 0.0017 100.00% 0.1200 0.000 Server failure 17 0.0000 1.59% 0.0100 99.332 No such name 353 0.0006 33.11% 0.0400 0.000 No error 696 0.0011 65.29% 0.0800 0.000 ▾ opcodes 1066 0.0017 100.00% 0.1200 0.000 Standard query 1066 0.0017 100.00% 0.1200 0.000 ▾ Query/Response 1066 0.0017 100.00% 0.1200 0.000 Response 531 0.0009 49.81% 0.0600 0.000 Query 535 0.0009 50.19% 0.0600 0.000 ▾ Query Type 1066 0.0017 100.00% 0.1200 0.000 AAAA 167 0.0003 15.67% 0.0200 0.000 A 899 0.0015 84.33% 0.1000 0.000 ▾ Class 1066 0.0017 100.00% 0.1200 0.000 IN 1066 0.0017 100.00% 0.1200 0.000 ▾ Service Stats 0 0.0000 100% - - request-response time (msec) 531 184.42 0.067000 6308.503906 0.0009 0.0600 0.000 no. of unsolicited responses 0 0.0000 - - no. of retransmissions 0 0.0000 - - ▾ Response Stats 0 0.0000 100% - - no. of questions 1062 1.00 1 1 0.0017 0.1200 0.000 no. of authorities 1062 0.82 0 1 0.0017 0.1200 0.000 no. of answers 1062 0.15 0 1 0.0017 0.1200 0.000 no. of additionals 1062 0.00 0 0 0.0017 0.1200 0.000 ▾ Query Stats 0 0.0000 100% - - Qname Len 535 32.99 14 66 0.0009 0.0600 0.000 ▾ Label Stats 0 0.0000 - - 4th Level or more 365 0.0006 0.0400 0.000 3rd Level 170 0.0003 0.0200 0.000 2nd Level 0 0.0000 - - 1st Level 0 0.0000 - - Payload size 1066 92.87 32 194 0.0017 100% 0.1200 0.000
The DNS analysis dialog box in Wireshark shows a total of 1,066 packets. Of these packets, 17 (1.59 percent) caused a server failure. Additionally, the maximum response time was 6,308 milliseconds (6.3 seconds), and no response was received for 0.38 percent of the queries. (This total was calculated by subtracting the 49.81 percent of packets that contained responses from the 50.19 percent of packets that contained queries.)
If you enter (dns.flags.response == 0) && ! dns.response_in
as a display filter in Wireshark, this filter displays DNS queries that didn't receive a response, as shown in the following table.
No. | Time | Source | Destination | Protocol | Length | Info |
---|---|---|---|---|---|---|
225 | 2024-04-01 16:50:40.000520 | 10.0.0.21 | 172.16.0.10 | DNS | 80 | Standard query 0x2c67 AAAA db.contoso.com |
426 | 2024-04-01 16:52:47.419907 | 10.0.0.21 | 172.16.0.10 | DNS | 132 | Standard query 0x8038 A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net |
693 | 2024-04-01 16:55:23.105558 | 10.0.0.21 | 172.16.0.10 | DNS | 132 | Standard query 0xbcb0 A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net |
768 | 2024-04-01 16:56:06.512464 | 10.0.0.21 | 172.16.0.10 | DNS | 80 | Standard query 0xe330 A db.contoso.com |
Additionally, the Wireshark status bar displays the text Packets: 1066 - Displayed: 4 (0.4%). This information means that four of the 1,066 packets, or 0.4 percent, were DNS queries that never received a response. This percentage essentially matches the 0.38 percent total that you calculated earlier.
If you change the display filter to dns.time >= 5
, the filter shows query response packets that took five seconds or more to be received, as shown in the updated table.
No. | Time | Source | Destination | Protocol | Length | Info | SourcePort | Additional RRs | dns resp time |
---|---|---|---|---|---|---|---|---|---|
213 | 2024-04-01 16:50:32.644592 | 172.16.0.10 | 10.0.0.21 | DNS | 132 | Standard query 0x9312 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net | 53 | 0 | 6.229941 |
320 | 2024-04-01 16:51:55.053896 | 172.16.0.10 | 10.0.0.21 | DNS | 80 | Standard query 0xe5ce Server failure A db.contoso.com | 53 | 0 | 6.065555 |
328 | 2024-04-01 16:51:55.113619 | 172.16.0.10 | 10.0.0.21 | DNS | 132 | Standard query 0x6681 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net | 53 | 0 | 6.029641 |
335 | 2024-04-01 16:52:02.553811 | 172.16.0.10 | 10.0.0.21 | DNS | 80 | Standard query 0x6cf6 Server failure A db.contoso.com | 53 | 0 | 6.500504 |
541 | 2024-04-01 16:53:53.423838 | 172.16.0.10 | 10.0.0.21 | DNS | 80 | Standard query 0x07b3 Server failure AAAA db.contoso.com | 53 | 0 | 6.022195 |
553 | 2024-04-01 16:54:05.165234 | 172.16.0.10 | 10.0.0.21 | DNS | 132 | Standard query 0x1ea0 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net | 53 | 0 | 6.007022 |
774 | 2024-04-01 16:56:17.553531 | 172.16.0.10 | 10.0.0.21 | DNS | 80 | Standard query 0xa20f Server failure AAAA db.contoso.com | 53 | 0 | 6.014926 |
891 | 2024-04-01 16:56:44.442334 | 172.16.0.10 | 10.0.0.21 | DNS | 132 | Standard query 0xa279 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net | 53 | 0 | 6.044552 |
After you change the display filter, the status bar is updated to show the text Packets: 1066 - Displayed: 8 (0.8%). Therefore, eight of the 1,066 packets, or 0.8 percent, were DNS responses that took five seconds or more to be received. However, on most clients, the default DNS time-out value is expected to be five seconds. This expectation means that, although the CoreDNS pods processed and delivered the eight responses, the client already ended the session by issuing a "timed out" error message. The Info column in the filtered results shows that all eight packets caused a server failure.
DNS packet analysis for all CoreDNS pods
In Wireshark, open the capture file of the CoreDNS pods that you merged earlier (coredns-cap1.pcap), and then open the DNS analysis, as described in the previous section. A Wireshark dialog box appears that displays the following table.
Topic / Item | Count | Average | Min Val | Max Val | Rate (ms) | Percent | Burst Rate | Burst Start |
---|---|---|---|---|---|---|---|---|
▾ Total Packets | 4540233 | 7.3387 | 100% | 84.7800 | 592.950 | |||
▾ rcode | 4540233 | 7.3387 | 100.00% | 84.7800 | 592.950 | |||
Server failure | 121781 | 0.1968 | 2.68% | 8.4600 | 599.143 | |||
No such name | 574658 | 0.9289 | 12.66% | 10.9800 | 592.950 | |||
No error | 3843794 | 6.2130 | 84.66% | 73.2500 | 592.950 | |||
▾ opcodes | 4540233 | 7.3387 | 100.00% | 84.7800 | 592.950 | |||
Standard query | 4540233 | 7.3387 | 100.00% | 84.7800 | 592.950 | |||
▾ Query/Response | 4540233 | 7.3387 | 100.00% | 84.7800 | 592.950 | |||
Response | 2135116 | 3.4512 | 47.03% | 39.0400 | 581.680 | |||
Query | 2405117 | 3.8876 | 52.97% | 49.1400 | 592.950 | |||
▾ Query Type | 4540233 | 7.3387 | 100.00% | 84.7800 | 592.950 | |||
SRV | 3647 | 0.0059 | 0.08% | 0.1800 | 586.638 | |||
PTR | 554630 | 0.8965 | 12.22% | 11.5400 | 592.950 | |||
NS | 15918 | 0.0257 | 0.35% | 0.7200 | 308.225 | |||
MX | 393016 | 0.6353 | 8.66% | 7.9700 | 426.930 | |||
AAAA | 384032 | 0.6207 | 8.46% | 8.4700 | 438.155 | |||
A | 3188990 | 5.1546 | 70.24% | 57.9600 | 592.950 | |||
▾ Class | 4540233 | 7.3387 | 100.00% | 84.7800 | 592.950 | |||
IN | 4540233 | 7.3387 | 100.00% | 84.7800 | 592.950 | |||
▾ Service Stats | 0 | 0.0000 | 100% | - | - | |||
request-response time (msec) | 2109677 | 277.18 | 0.020000 | 12000.532227 | 3.4100 | 38.0100 | 581.680 | |
no. of unsolicited responses | 25402 | 0.0411 | 5.1400 | 587.832 | ||||
no. of retransmissions | 37 | 0.0001 | 0.0300 | 275.702 | ||||
▾ Response Stats | 0 | 0.0000 | 100% | - | - | |||
no. of questions | 4244830 | 1.00 | 1 | 1 | 6.8612 | 77.0500 | 581.680 | |
no. of authorities | 4244830 | 0.39 | 0 | 11 | 6.8612 | 77.0500 | 581.680 | |
no. of answers | 4244830 | 1.60 | 0 | 22 | 6.8612 | 77.0500 | 581.680 | |
no. of additionals | 4244830 | 0.29 | 0 | 26 | 6.8612 | 77.0500 | 581.680 | |
▾ Query Stats | 0 | 0.0000 | 100% | - | - | |||
Qname Len | 2405117 | 20.42 | 2 | 113 | 3.8876 | 49.1400 | 592.950 | |
▾ Label Stats | 0 | 0.0000 | - | - | ||||
4th Level or more | 836034 | 1.3513 | 16.1900 | 592.950 | ||||
3rd Level | 1159513 | 1.8742 | 23.8900 | 592.950 | ||||
2nd Level | 374182 | 0.6048 | 8.7800 | 592.955 | ||||
1st Level | 35388 | 0.0572 | 0.9200 | 294.492 | ||||
Payload size | 4540233 | 89.87 | 17 | 1128 | 7.3387 | 100% | 84.7800 | 592.950 |
The dialog box indicates that there were a combined total of about 4.5 million (4,540,233) packets, of which 2.68 percent caused server failure. The difference in query and response packet percentages shows that 5.94 percent of the queries (52.97 percent minus 47.03 percent) didn't receive a response. The maximum response time was 12 seconds (12,000.532227 milliseconds).
If you apply the display filter for query responses that took five seconds or more (dns.time >= 5
), most of the packets in the filter results will indicate that a server failure occurred. This is probably because of a client "timed out" error.
The following table is a summary of the capture findings.
Capture review criteria | Yes | No |
---|---|---|
Difference between DNS queries and responses exceeds two percent | ☑ | ☐ |
DNS latency is more than one second | ☑ | ☐ |
Troubleshooting Step 2: Develop a hypothesis
This section categorizes common problem types to help you narrow down potential problems and identify components that might require adjustments. This approach sets the foundation for creating a targeted action plan to mitigate and resolve these problems effectively.
Common DNS response codes
The following table summarizes the most common DNS return codes.
DNS return code | DNS return message | Description |
---|---|---|
RCODE:0 |
NOERROR |
The DNS query finished successfully. |
RCODE:1 |
FORMERR |
A DNS query format error exists. |
RCODE:2 |
SERVFAIL |
The server didn't complete the DNS request. |
RCODE:3 |
NXDOMAIN |
The ___domain name doesn't exist. |
RCODE:5 |
REFUSED |
The server refused to answer the query. |
RCODE:8 |
NOTAUTH |
The server isn't authoritative for the zone. |
General problem types
The following table lists problem type categories that help you break down the problem symptoms.
Problem type | Description |
---|---|
Performance | DNS resolution performance problems can cause intermittent errors, such as "timed out" errors from a client's perspective. These problems might occur because nodes experience resource exhaustion or I/O throttling. Additionally, constraints on compute resources in CoreDNS pods can cause resolution latency. If CoreDNS latency is high or increases over time, this might indicate a load problem. If CoreDNS instances are overloaded, you might experience DNS name resolution problems and delays, or you might see problems in workloads and Kubernetes internal services. |
Configuration | Configuration problems can cause incorrect DNS resolution. In this case, you might experience NXDOMAIN or "timed out" errors. Incorrect configurations might occur in CoreDNS, nodes, Kubernetes, routing, virtual network DNS, private DNS zones, firewalls, proxies, and so on. |
Network connectivity | Network connectivity problems can affect pod-to-pod connectivity (east-west traffic) or pod-and-node connectivity to external resources (north-south traffic). This scenario can cause "timed out" errors. The connectivity problems might occur if the CoreDNS service endpoints aren't up to date (for example, because of kube-proxy problems, routing problems, packet loss, and so on). External resource dependency combined with connectivity problems (for example, dependency on custom DNS servers or external DNS servers) can also contribute to the problem. |
Required inputs
Before you formulate a hypothesis of probable causes for the problem, summarize the results from the previous steps of the troubleshooting workflow.
You can collect the results by using the following tables.
Results of the baseline questionnaire template
Question | Possible answers |
---|---|
Where does the DNS resolution fail? | ☐ Pod ☐ Node ☐ Both pod and node |
What type of DNS error do you get? | ☐ Timed out ☐ NXDOMAIN ☐ Other DNS error |
How often do the DNS errors occur? | ☐ Always ☐ Intermittently ☐ In a specific pattern |
Which records are affected? | ☐ A specific ___domain ☐ Any ___domain |
Do any custom DNS configurations exist? | ☐ Custom DNS servers on a virtual network ☐ Custom CoreDNS configuration |
Results of tests at different levels
Resolution test results | Works | Fails |
---|---|---|
From pod to CoreDNS service | ☐ | ☐ |
From pod to CoreDNS pod IP address | ☐ | ☐ |
From pod to Azure internal DNS | ☐ | ☐ |
From pod to virtual network DNS | ☐ | ☐ |
From node to Azure internal DNS | ☐ | ☐ |
From node to virtual network DNS | ☐ | ☐ |
Results of health and performance of the nodes and the CoreDNS pods
Performance review results | Healthy | Unhealthy |
---|---|---|
Nodes performance | ☐ | ☐ |
CoreDNS pods performance | ☐ | ☐ |
Results of traffic captures and DNS resolution performance
Capture review criteria | Yes | No |
---|---|---|
Difference between DNS queries and responses exceeds two percent | ☐ | ☐ |
DNS latency is more than one second | ☐ | ☐ |
Map required inputs to problem types
To develop your first hypothesis, map each of the results from the required inputs to one or more of the problem types. By analyzing these results in the context of problem types, you can develop hypotheses about the potential root causes of the DNS resolution problems. Then, you can create an action plan of targeted investigation and troubleshooting.
Error type mapping pointers
If test results show DNS resolution failures at the CoreDNS service, or contain "timed out" errors when trying to reach specific endpoints, then configuration or connectivity problems might exist.
Indications of compute resource starvation at CoreDNS pod or node levels might suggest performance problems.
DNS captures that have a considerable mismatch between DNS queries and DNS responses can indicate that packets are being lost. This scenario suggests that there are connectivity or performance problems.
The presence of custom configurations at the virtual network level or Kubernetes level can contain setups that don't work with AKS and CoreDNS as expected.
Troubleshooting Step 3: Create and implement an action plan
You should now have enough information to create and implement an action plan. The following sections contain extra recommendations to formulate your plan for specific problem types.
Performance problems
If you're dealing with DNS resolution performance problems, review and implement the following best practices and guidance.
Best practice | Guidance |
---|---|
Set up a dedicated system node pool that meets minimum sizing requirements. | Manage system node pools in Azure Kubernetes Service (AKS) |
To avoid disk I/O throttling, use nodes that have Ephemeral OS disks. | Default OS disk sizing and GitHub issue 1373 in Azure AKS |
Follow best resource management practices on workloads within the nodes. | Best practices for application developers to manage resources in Azure Kubernetes Service (AKS) |
If DNS performance still isn't good enough after you make these changes, consider using Node Local DNS.
Configuration problems
Depending on the component, you should review and understand the implications of the specific setup. See the following list of component-specific documentation for configuration details:
- Kubernetes DNS configuration options
- AKS CoreDNS custom configuration options
- Private DNS zones missing a virtual network link
Network connectivity problems
Bugs that involve the Container Networking Interface (CNI) or other Kubernetes or OS components usually require intervention from AKS support or the AKS product group.
Infrastructure problems, such as hardware failures or hypervisor problems, might require collaboration from infrastructure support teams. Alternatively, these problems might have self-healing features.
Troubleshooting Step 4: Observe results and draw conclusions
Observe the results of implementing your action plan. At this point, your action plan should be able to fix or mitigate the problem.
Troubleshooting Step 5: Repeat as necessary
If these troubleshooting steps don't resolve the problem, repeat the troubleshooting steps as necessary.
Third-party information disclaimer
The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, about the performance or reliability of these products.
Third-party contact disclaimer
Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.
Contact us for help
If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.