Edit

Share via


Troubleshoot DNS resolution problems in AKS

This article discusses how to create a troubleshooting workflow to fix Domain Name System (DNS) resolution problems in Microsoft Azure Kubernetes Service (AKS).

Prerequisites

Troubleshooting checklist

Troubleshooting DNS problems in AKS is typically a complex process. You can easily get lost in the many different steps without ever seeing a clear path forward. To help make the process simpler and more effective, use the "scientific" method to organize the work:

  • Step 1. Collect the facts.

  • Step 2. Develop a hypothesis.

  • Step 3. Create and implement an action plan.

  • Step 4. Observe the results and draw conclusions.

  • Step 5. Repeat as necessary.

Troubleshooting Step 1: Collect the facts

To better understand the context of the problem, gather facts about the specific DNS problem. By using the following baseline questions as a starting point, you can describe the nature of the problem, recognize the symptoms, and identify the scope of the problem.

Question Possible answers
Where does the DNS resolution fail?
  • Pod
  • Node
  • Both pods and nodes
What kind of DNS error do you get?
  • Time-out
  • No such host
  • Other DNS error
How often do the DNS errors occur?
  • Always
  • Intermittently
  • In a specific pattern
Which records are affected?
  • A specific ___domain
  • Any ___domain
Do any custom DNS configurations exist?
  • Custom DNS server configured on the virtual network
  • Custom DNS on CoreDNS configuration
What kind of performance problems are affecting the nodes?
  • CPU
  • Memory
  • I/O throttling
What kind of performance problems are affecting the CoreDNS pods?
  • CPU
  • Memory
  • I/O throttling
What causes DNS latency? DNS queries that take too much time to receive a response (more the five seconds)

To get better answers to these questions, follow this three-part process.

Part 1: Generate tests at different levels that reproduce the problem

The DNS resolution process for pods on AKS includes many layers. Review these layers to isolate the problem. The following layers are typical:

  • CoreDNS pods
  • CoreDNS service
  • Nodes
  • Virtual network DNS

To start the process, run tests from a test pod against each layer.

Test the DNS resolution at CoreDNS pod level
  1. Deploy a test pod to run DNS test queries by running the following command:

    cat <<EOF | kubectl apply --filename -
    apiVersion: v1
    kind: Pod
    metadata:
      name: aks-test
    spec:
      containers:
      - name: aks-test
        image: debian:stable
        command: ["/bin/sh"]
        args: ["-c", "apt-get update && apt-get install -y dnsutils && while true; do sleep 1000; done"]
    EOF
    
  2. Retrieve the IP addresses of the CoreDNS pods by running the following kubectl get command:

    kubectl get pod --namespace kube-system --selector k8s-app=kube-dns --output wide
    
  3. Connect to the test pod using the kubectl exec -it aks-test -- bash command and test the DNS resolution against each CoreDNS pod IP address by running the following commands:

    # Placeholder values
    FQDN="<fully-qualified-___domain-name>"  # For example, "db.contoso.com"
    DNS_SERVER="<coredns-pod-ip-address>"
    
    # Test loop
    for i in $(seq 1 1 10)
    do
        echo "host= $(dig +short ${FQDN} @${DNS_SERVER})"
        sleep 1
    done
    

For more information about troubleshooting DNS resolution problems from the pod level, see Troubleshoot DNS resolution failures from inside the pod.

Test the DNS resolution at CoreDNS service level
  1. Retrieve the CoreDNS service IP address by running the following kubectl get command:

    kubectl get service kube-dns --namespace kube-system
    
  2. On the test pod, run the following commands against the CoreDNS service IP address:

    # Placeholder values
    FQDN="<fully-qualified-___domain-name>"  # For example, "db.contoso.com"
    DNS_SERVER="<kubedns-service-ip-address>"
    
    # Test loop
    for i in $(seq 1 1 10)
    do
        echo "host= $(dig +short ${FQDN} @${DNS_SERVER})"
        sleep 1
    done
    
Test the DNS resolution at node level
  1. Connect to the node.

  2. Run the following grep command to retrieve the list of upstream DNS servers that are configured:

    grep ^nameserver /etc/resolv.conf
    
  3. Run the following text commands against each DNS that's configured in the node:

    # Placeholder values
    FQDN="<fully-qualified-___domain-name>"  # For example, "db.contoso.com"
    DNS_SERVER="<dns-server-in-node-configuration>"
    
    # Test loop
    for i in $(seq 1 1 10)
    do
        echo "host= $(dig +short ${FQDN} @${DNS_SERVER})"
        sleep 1
    done
    
Test the DNS resolution at virtual network DNS level

Examine the DNS server configuration of the virtual network, and determine whether the servers can resolve the record in question.

Part 2: Review the health and performance of CoreDNS pods and nodes

Review the health and performance of CoreDNS pods

You can use kubectl commands to check the health and performance of CoreDNS pods. To do so, follow these steps:

  1. Verify that the CoreDNS pods are running:

    kubectl get pods -l k8s-app=kube-dns -n kube-system
    
  2. Check if the CoreDNS pods are overused:

    kubectl top pods -n kube-system -l k8s-app=kube-dns
    
    NAME                      CPU(cores)   MEMORY(bytes)
    coredns-dc97c5f55-424f7   3m           23Mi
    coredns-dc97c5f55-wbh4q   3m           25Mi
    
  3. Get the nodes that host the CoreDNS pods:

    kubectl get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].spec.nodeName}'
    
  4. Verify that the nodes aren't overused:

    kubectl top nodes
    
  5. Verify the logs for the CoreDNS pods:

    kubectl logs -l k8s-app=kube-dns -n kube-system
    

Note

To get more debugging information, enable verbose logs in CoreDNS. To do so, see Troubleshooting CoreDNS customization in AKS.

Review the health and performance of nodes

You might first notice DNS resolution performance problems as intermittent errors, such as time-outs. The main causes of this problem include resource exhaustion and I/O throttling within nodes that host the CoreDNS pods or the client pod.

To check whether resource exhaustion or I/O throttling is occurring, run the following kubectl describe command together with the grep command on your nodes. This series of commands lets you review the request count and compare it against the limit for each resource. If the limit percentage is more than 100 percent for a resource, that resource is overcommitted.

kubectl describe node | grep -A5 '^Name:\|^Allocated resources:' | grep -v '.kubernetes.io\|^Roles:\|Labels:'

The following snippet shows example output from this command:

Name:               aks-nodepool1-17046773-vmss00000m
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                250m (13 percent)  40m (2 percent)
  memory             420Mi (9 percent)  1762Mi (41 percent)
--
Name:               aks-nodepool1-17046773-vmss00000n
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests            Limits
  --------           --------            ------
  cpu                612m (32 percent)   8532m (449 percent)
  memory             804Mi (18 percent)  6044Mi (140 percent)
--
Name:               aks-nodepool1-17046773-vmss00000o
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                250m (13 percent)  40m (2 percent)
  memory             420Mi (9 percent)  1762Mi (41 percent)
--
Name:               aks-ubuntu-16984727-vmss000008
--
Allocated resources:
  (Total limits might be more than 100 percent.)
  Resource           Requests            Limits
  --------           --------            ------
  cpu                250m (13 percent)   40m (2 percent)
  memory             420Mi (19 percent)  1762Mi (82 percent)

To get a better picture of resource usage at the pod and node level, you can also use Container insights and other cloud-native tools in Azure. For more information, see Monitor Kubernetes clusters using Azure services and cloud native tools.

Part 3: Analyze DNS traffic and review DNS resolution performance

Analyzing DNS traffic can help you understand how your AKS cluster handles the DNS queries. Ideally, you should reproduce the problem on a test pod while you capture the traffic from this test pod and on each of the CoreDNS pods.

There are two main ways to analyze DNS traffic:

  • Using real-time DNS analysis tools, such as Inspektor Gadget, to analyze the DNS traffic in real time.
  • Using traffic capture tools, such as Retina Capture and Dumpy, to collect the DNS traffic and analyze it with a network packet analyzer tool, such as Wireshark.

Both approaches aim to understand the health and performance of DNS responses using DNS response codes, response times, and other metrics. Choose the one that fits your needs best.

Real-time DNS traffic analysis

You can use Inspektor Gadget to analyze the DNS traffic in real time. To install Inspektor Gadget to your cluster, see How to install Inspektor Gadget in an AKS cluster.

To trace DNS traffic across all namespaces, use the following command:

# Get the version of Gadget
GADGET_VERSION=$(kubectl gadget version | grep Server | awk '{print $3}')
# Run the trace_dns gadget
kubectl gadget run trace_dns:$GADGET_VERSION --all-namespaces --fields "src,dst,name,qr,qtype,id,rcode,latency_ns"

Where --fields is a comma-separated list of fields to be displayed. The following list describes the fields that are used in the command:

  • src: The source of the request with Kubernetes information (<kind>/<namespace>/<name>:<port>).
  • dst: The destination of the request with Kubernetes information (<kind>/<namespace>/<name>:<port>).
  • name: The name of the DNS request.
  • qr: The query/response flag.
  • qtype: The type of the DNS request.
  • id: The ID of the DNS request, which is used to match the request and response.
  • rcode: The response code of the DNS request.
  • latency_ns: The latency of the DNS request.

The command output looks like the following:

SRC                                  DST                                  NAME                        QR QTYPE          ID             RCODE           LATENCY_NS
p/default/aks-test:33141             p/kube-system/coredns-57d886c994-r2… db.contoso.com.             Q  A              c215                                  0ns
p/kube-system/coredns-57d886c994-r2… 168.63.129.16:53                     db.contoso.com.             Q  A              323c                                  0ns
168.63.129.16:53                     p/kube-system/coredns-57d886c994-r2… db.contoso.com.             R  A              323c           NameErr…           13.64ms
p/kube-system/coredns-57d886c994-r2… p/default/aks-test:33141             db.contoso.com.             R  A              c215           NameErr…               0ns
p/default/aks-test:56921             p/kube-system/coredns-57d886c994-r2… db.contoso.com.             Q  A              6574                                  0ns
p/kube-system/coredns-57d886c994-r2… p/default/aks-test:56921             db.contoso.com.             R  A              6574           NameErr…               0ns

You can use the ID field to identify whether a query has a response. The RCODE field shows you the response code of the DNS request. The LATENCY_NS field shows you the latency of the DNS request in nanoseconds. These fields can help you understand the health and performance of DNS responses. For more information about real-time DNS analysis, see Troubleshoot DNS failures across an AKS cluster in real time.

Capture DNS traffic

This section demonstrates how to use Dumpy to collect DNS traffic captures from each CoreDNS pod and a client DNS pod (in this case, the aks-test pod).

To collect the captures from the test client pod, run the following command:

kubectl dumpy capture pod aks-test -f "-i any port 53" --name dns-cap1-aks-test

To collect captures for the CoreDNS pods, run the following Dumpy command:

kubectl dumpy capture deploy coredns \
    -n kube-system \
    -f "-i any port 53" \
    --name dns-cap1-coredns

Ideally, you should be running captures while the problem reproduces. This requirement means that different captures might be running for different amounts of time, depending on how often you can reproduce the problem. To collect the captures, run the following commands:

mkdir dns-captures
kubectl dumpy export dns-cap1-aks-test ./dns-captures
kubectl dumpy export dns-cap1-coredns ./dns-captures -n kube-system

To delete the Dumpy pods, run the following Dumpy command:

kubectl dumpy delete dns-cap1-coredns -n kube-system
kubectl dumpy delete dns-cap1-aks-test

To merge all the CoreDNS pod captures, use the mergecap command line tool for merging capture files. The mergecap tool is included in the Wireshark network packet analyzer tool. Run the following mergecap command:

mergecap -w coredns-cap1.pcap dns-cap1-coredns-<coredns-pod-name-1>.pcap dns-cap1-coredns-<coredns-pod-name-2>.pcap [...]
DNS packet analysis for an individual CoreDNS pod

After you generate and merge your traffic capture files, you can do a DNS packet analysis of the capture files in Wireshark. Follow these steps to view the packet analysis for the traffic of an individual CoreDNS pod:

  1. Select Start, enter Wireshark, and then select Wireshark in the search results.

  2. In the Wireshark window, select the File menu, and then select Open.

  3. Navigate to the folder that contains your capture files, select dns-cap1-db-check-<db-check-pod-name>.pcap (the client-side capture file for an individual CoreDNS pod), and then select the Open button.

  4. Select the Statistics menu, and then select DNS. The Wireshark - DNS dialog box appears and displays an analysis of the DNS traffic. The contents of the dialog box are shown in the following table.

    Topic / Item Count Average Min Val Max Val Rate (ms) Percent Burst Rate Burst Start
    ▾ Total Packets 1066 0.0017 100% 0.1200 0.000
     ▾ rcode 1066 0.0017 100.00% 0.1200 0.000
       Server failure 17 0.0000 1.59% 0.0100 99.332
       No such name 353 0.0006 33.11% 0.0400 0.000
       No error 696 0.0011 65.29% 0.0800 0.000
     ▾ opcodes 1066 0.0017 100.00% 0.1200 0.000
       Standard query 1066 0.0017 100.00% 0.1200 0.000
     ▾ Query/Response 1066 0.0017 100.00% 0.1200 0.000
       Response 531 0.0009 49.81% 0.0600 0.000
       Query 535 0.0009 50.19% 0.0600 0.000
     ▾ Query Type 1066 0.0017 100.00% 0.1200 0.000
       AAAA 167 0.0003 15.67% 0.0200 0.000
       A 899 0.0015 84.33% 0.1000 0.000
     ▾ Class 1066 0.0017 100.00% 0.1200 0.000
       IN 1066 0.0017 100.00% 0.1200 0.000
    ▾ Service Stats 0 0.0000 100% - -
      request-response time (msec) 531 184.42 0.067000 6308.503906 0.0009 0.0600 0.000
      no. of unsolicited responses 0 0.0000 - -
      no. of retransmissions 0 0.0000 - -
    ▾ Response Stats 0 0.0000 100% - -
      no. of questions 1062 1.00 1 1 0.0017 0.1200 0.000
      no. of authorities 1062 0.82 0 1 0.0017 0.1200 0.000
      no. of answers 1062 0.15 0 1 0.0017 0.1200 0.000
      no. of additionals 1062 0.00 0 0 0.0017 0.1200 0.000
    ▾ Query Stats 0 0.0000 100% - -
      Qname Len 535 32.99 14 66 0.0009 0.0600 0.000
     ▾ Label Stats 0 0.0000 - -
       4th Level or more 365 0.0006 0.0400 0.000
       3rd Level 170 0.0003 0.0200 0.000
       2nd Level 0 0.0000 - -
       1st Level 0 0.0000 - -
     Payload size 1066 92.87 32 194 0.0017 100% 0.1200 0.000

The DNS analysis dialog box in Wireshark shows a total of 1,066 packets. Of these packets, 17 (1.59 percent) caused a server failure. Additionally, the maximum response time was 6,308 milliseconds (6.3 seconds), and no response was received for 0.38 percent of the queries. (This total was calculated by subtracting the 49.81 percent of packets that contained responses from the 50.19 percent of packets that contained queries.)

If you enter (dns.flags.response == 0) && ! dns.response_in as a display filter in Wireshark, this filter displays DNS queries that didn't receive a response, as shown in the following table.

No. Time Source Destination Protocol Length Info
225 2024-04-01 16:50:40.000520 10.0.0.21 172.16.0.10 DNS 80 Standard query 0x2c67 AAAA db.contoso.com
426 2024-04-01 16:52:47.419907 10.0.0.21 172.16.0.10 DNS 132 Standard query 0x8038 A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net
693 2024-04-01 16:55:23.105558 10.0.0.21 172.16.0.10 DNS 132 Standard query 0xbcb0 A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net
768 2024-04-01 16:56:06.512464 10.0.0.21 172.16.0.10 DNS 80 Standard query 0xe330 A db.contoso.com

Additionally, the Wireshark status bar displays the text Packets: 1066 - Displayed: 4 (0.4%). This information means that four of the 1,066 packets, or 0.4 percent, were DNS queries that never received a response. This percentage essentially matches the 0.38 percent total that you calculated earlier.

If you change the display filter to dns.time >= 5, the filter shows query response packets that took five seconds or more to be received, as shown in the updated table.

No. Time Source Destination Protocol Length Info SourcePort Additional RRs dns resp time
213 2024-04-01 16:50:32.644592 172.16.0.10 10.0.0.21 DNS 132 Standard query 0x9312 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net 53 0 6.229941
320 2024-04-01 16:51:55.053896 172.16.0.10 10.0.0.21 DNS 80 Standard query 0xe5ce Server failure A db.contoso.com 53 0 6.065555
328 2024-04-01 16:51:55.113619 172.16.0.10 10.0.0.21 DNS 132 Standard query 0x6681 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net 53 0 6.029641
335 2024-04-01 16:52:02.553811 172.16.0.10 10.0.0.21 DNS 80 Standard query 0x6cf6 Server failure A db.contoso.com 53 0 6.500504
541 2024-04-01 16:53:53.423838 172.16.0.10 10.0.0.21 DNS 80 Standard query 0x07b3 Server failure AAAA db.contoso.com 53 0 6.022195
553 2024-04-01 16:54:05.165234 172.16.0.10 10.0.0.21 DNS 132 Standard query 0x1ea0 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net 53 0 6.007022
774 2024-04-01 16:56:17.553531 172.16.0.10 10.0.0.21 DNS 80 Standard query 0xa20f Server failure AAAA db.contoso.com 53 0 6.014926
891 2024-04-01 16:56:44.442334 172.16.0.10 10.0.0.21 DNS 132 Standard query 0xa279 Server failure A db.contoso.com.iosffyoulcwehgo1g3egb3m4oc.jx.internal.cloudapp.net 53 0 6.044552

After you change the display filter, the status bar is updated to show the text Packets: 1066 - Displayed: 8 (0.8%). Therefore, eight of the 1,066 packets, or 0.8 percent, were DNS responses that took five seconds or more to be received. However, on most clients, the default DNS time-out value is expected to be five seconds. This expectation means that, although the CoreDNS pods processed and delivered the eight responses, the client already ended the session by issuing a "timed out" error message. The Info column in the filtered results shows that all eight packets caused a server failure.

DNS packet analysis for all CoreDNS pods

In Wireshark, open the capture file of the CoreDNS pods that you merged earlier (coredns-cap1.pcap), and then open the DNS analysis, as described in the previous section. A Wireshark dialog box appears that displays the following table.

Topic / Item Count Average Min Val Max Val Rate (ms) Percent Burst Rate Burst Start
Total Packets 4540233 7.3387 100% 84.7800 592.950
 ▾ rcode 4540233 7.3387 100.00% 84.7800 592.950
   Server failure 121781 0.1968 2.68% 8.4600 599.143
   No such name 574658 0.9289 12.66% 10.9800 592.950
   No error 3843794 6.2130 84.66% 73.2500 592.950
 ▾ opcodes 4540233 7.3387 100.00% 84.7800 592.950
   Standard query 4540233 7.3387 100.00% 84.7800 592.950
 ▾ Query/Response 4540233 7.3387 100.00% 84.7800 592.950
   Response 2135116 3.4512 47.03% 39.0400 581.680
   Query 2405117 3.8876 52.97% 49.1400 592.950
 ▾ Query Type 4540233 7.3387 100.00% 84.7800 592.950
   SRV 3647 0.0059 0.08% 0.1800 586.638
   PTR 554630 0.8965 12.22% 11.5400 592.950
   NS 15918 0.0257 0.35% 0.7200 308.225
   MX 393016 0.6353 8.66% 7.9700 426.930
   AAAA 384032 0.6207 8.46% 8.4700 438.155
   A 3188990 5.1546 70.24% 57.9600 592.950
 ▾ Class 4540233 7.3387 100.00% 84.7800 592.950
   IN 4540233 7.3387 100.00% 84.7800 592.950
▾ Service Stats 0 0.0000 100% - -
  request-response time (msec) 2109677 277.18 0.020000 12000.532227 3.4100 38.0100 581.680
  no. of unsolicited responses 25402 0.0411 5.1400 587.832
  no. of retransmissions 37 0.0001 0.0300 275.702
▾ Response Stats 0 0.0000 100% - -
  no. of questions 4244830 1.00 1 1 6.8612 77.0500 581.680
  no. of authorities 4244830 0.39 0 11 6.8612 77.0500 581.680
  no. of answers 4244830 1.60 0 22 6.8612 77.0500 581.680
  no. of additionals 4244830 0.29 0 26 6.8612 77.0500 581.680
▾ Query Stats 0 0.0000 100% - -
  Qname Len 2405117 20.42 2 113 3.8876 49.1400 592.950
 ▾ Label Stats 0 0.0000 - -
   4th Level or more 836034 1.3513 16.1900 592.950
   3rd Level 1159513 1.8742 23.8900 592.950
   2nd Level 374182 0.6048 8.7800 592.955
   1st Level 35388 0.0572 0.9200 294.492
 Payload size 4540233 89.87 17 1128 7.3387 100% 84.7800 592.950

The dialog box indicates that there were a combined total of about 4.5 million (4,540,233) packets, of which 2.68 percent caused server failure. The difference in query and response packet percentages shows that 5.94 percent of the queries (52.97 percent minus 47.03 percent) didn't receive a response. The maximum response time was 12 seconds (12,000.532227 milliseconds).

If you apply the display filter for query responses that took five seconds or more (dns.time >= 5), most of the packets in the filter results will indicate that a server failure occurred. This is probably because of a client "timed out" error.

The following table is a summary of the capture findings.

Capture review criteria Yes No
Difference between DNS queries and responses exceeds two percent
DNS latency is more than one second

Troubleshooting Step 2: Develop a hypothesis

This section categorizes common problem types to help you narrow down potential problems and identify components that might require adjustments. This approach sets the foundation for creating a targeted action plan to mitigate and resolve these problems effectively.

Common DNS response codes

The following table summarizes the most common DNS return codes.

DNS return code DNS return message Description
RCODE:0 NOERROR The DNS query finished successfully.
RCODE:1 FORMERR A DNS query format error exists.
RCODE:2 SERVFAIL The server didn't complete the DNS request.
RCODE:3 NXDOMAIN The ___domain name doesn't exist.
RCODE:5 REFUSED The server refused to answer the query.
RCODE:8 NOTAUTH The server isn't authoritative for the zone.

General problem types

The following table lists problem type categories that help you break down the problem symptoms.

Problem type Description
Performance DNS resolution performance problems can cause intermittent errors, such as "timed out" errors from a client's perspective. These problems might occur because nodes experience resource exhaustion or I/O throttling. Additionally, constraints on compute resources in CoreDNS pods can cause resolution latency. If CoreDNS latency is high or increases over time, this might indicate a load problem. If CoreDNS instances are overloaded, you might experience DNS name resolution problems and delays, or you might see problems in workloads and Kubernetes internal services.
Configuration Configuration problems can cause incorrect DNS resolution. In this case, you might experience NXDOMAIN or "timed out" errors. Incorrect configurations might occur in CoreDNS, nodes, Kubernetes, routing, virtual network DNS, private DNS zones, firewalls, proxies, and so on.
Network connectivity Network connectivity problems can affect pod-to-pod connectivity (east-west traffic) or pod-and-node connectivity to external resources (north-south traffic). This scenario can cause "timed out" errors. The connectivity problems might occur if the CoreDNS service endpoints aren't up to date (for example, because of kube-proxy problems, routing problems, packet loss, and so on). External resource dependency combined with connectivity problems (for example, dependency on custom DNS servers or external DNS servers) can also contribute to the problem.

Required inputs

Before you formulate a hypothesis of probable causes for the problem, summarize the results from the previous steps of the troubleshooting workflow.

You can collect the results by using the following tables.

Results of the baseline questionnaire template
Question Possible answers
Where does the DNS resolution fail? ☐ Pod
☐ Node
☐ Both pod and node
What type of DNS error do you get? ☐ Timed out
NXDOMAIN
☐ Other DNS error
How often do the DNS errors occur? ☐ Always
☐ Intermittently
☐ In a specific pattern
Which records are affected? ☐ A specific ___domain
☐ Any ___domain
Do any custom DNS configurations exist? ☐ Custom DNS servers on a virtual network
☐ Custom CoreDNS configuration
Results of tests at different levels
Resolution test results Works Fails
From pod to CoreDNS service
From pod to CoreDNS pod IP address
From pod to Azure internal DNS
From pod to virtual network DNS
From node to Azure internal DNS
From node to virtual network DNS
Results of health and performance of the nodes and the CoreDNS pods
Performance review results Healthy Unhealthy
Nodes performance
CoreDNS pods performance
Results of traffic captures and DNS resolution performance
Capture review criteria Yes No
Difference between DNS queries and responses exceeds two percent
DNS latency is more than one second

Map required inputs to problem types

To develop your first hypothesis, map each of the results from the required inputs to one or more of the problem types. By analyzing these results in the context of problem types, you can develop hypotheses about the potential root causes of the DNS resolution problems. Then, you can create an action plan of targeted investigation and troubleshooting.

Error type mapping pointers

  • If test results show DNS resolution failures at the CoreDNS service, or contain "timed out" errors when trying to reach specific endpoints, then configuration or connectivity problems might exist.

  • Indications of compute resource starvation at CoreDNS pod or node levels might suggest performance problems.

  • DNS captures that have a considerable mismatch between DNS queries and DNS responses can indicate that packets are being lost. This scenario suggests that there are connectivity or performance problems.

  • The presence of custom configurations at the virtual network level or Kubernetes level can contain setups that don't work with AKS and CoreDNS as expected.

Troubleshooting Step 3: Create and implement an action plan

You should now have enough information to create and implement an action plan. The following sections contain extra recommendations to formulate your plan for specific problem types.

Performance problems

If you're dealing with DNS resolution performance problems, review and implement the following best practices and guidance.

Best practice Guidance
Set up a dedicated system node pool that meets minimum sizing requirements. Manage system node pools in Azure Kubernetes Service (AKS)
To avoid disk I/O throttling, use nodes that have Ephemeral OS disks. Default OS disk sizing and GitHub issue 1373 in Azure AKS
Follow best resource management practices on workloads within the nodes. Best practices for application developers to manage resources in Azure Kubernetes Service (AKS)

If DNS performance still isn't good enough after you make these changes, consider using Node Local DNS.

Configuration problems

Depending on the component, you should review and understand the implications of the specific setup. See the following list of component-specific documentation for configuration details:

Network connectivity problems

  • Bugs that involve the Container Networking Interface (CNI) or other Kubernetes or OS components usually require intervention from AKS support or the AKS product group.

  • Infrastructure problems, such as hardware failures or hypervisor problems, might require collaboration from infrastructure support teams. Alternatively, these problems might have self-healing features.

Troubleshooting Step 4: Observe results and draw conclusions

Observe the results of implementing your action plan. At this point, your action plan should be able to fix or mitigate the problem.

Troubleshooting Step 5: Repeat as necessary

If these troubleshooting steps don't resolve the problem, repeat the troubleshooting steps as necessary.

Third-party information disclaimer

The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, about the performance or reliability of these products.

Third-party contact disclaimer

Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.