many pod probes are failing after upgrading Kubernetes

Background

After upgrading the Kubernetes cluster from version 1.18 to 1.23, many previously running pods are experiencing restarts due to liveness probe failures, and some are failing readiness probes as well.

Initially, it was suspected that the pod restarts were not caused by the Kubernetes upgrade, as the container hash algorithm remained unchanged between versions 1.18 and 1.23. Therefore, upgrading kubelet should not result in the regeneration of already running containers. Further investigation revealed that the timing of pod restarts occurred after the Kubernetes upgrade, but it was not directly caused by the upgrade itself. Thus, it was possible to rule out kubelet upgrade as the cause of the restarts.

By comparing the logs of the old and new versions of kubelet, it was observed that the 1.23 version of kubelet’s logs contained numerous errors similar to the following, with one such error occurring approximately every few minutes:

text

I0915 19:09:04.534948  607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

Investigation Process

From the error message in the logs, it was determined that the error message “context deadline exceeded (Client.Timeout exceeded while awaiting headers)” originated from the net/http library’s client.

The search for reasons leading to this type of error revealed the following possibilities:

  1. The connection has been established, but the server response is slow.
  2. Connection establishment timed out.

Additional references related to HTTP client connection pools and reuse were also considered:

Based on this analysis, it was suspected that the issue might be related to connection establishment timeouts.

A deployment of Nginx with readiness probes configured was deployed. Why Nginx? This is because Nginx has access logs and error logs, making it possible to determine whether kubelet is performing the probe correctly.

text

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 80
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2

Additionally, packet capturing was performed within the pod’s network namespace, waiting for probe failures to occur:

text

(nsenter -t 972063 -n tcpdump -i eth0 -s0 -G 600 -Z root -w %Y_%m%d_%H%M_%S.pcap &)

With determination and effort, the capture of packets corresponding to probe failures was finally achieved.

text

No.     Time                          Source                Destination           Protocol Length Info
    649 2022-09-15 19:09:02.533940    10.23.77.129          10.23.77.138          TCP      66     36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 649: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    650 2022-09-15 19:09:02.533962    10.23.77.138          10.23.77.129          TCP      66     80 → 36704 [SYN, ACK] Seq=1523494022 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 650: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1523494022, Ack: 4142039475, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    651 2022-09-15 19:09:02.534203    10.23.77.129          10.23.77.138          TCP      54     36704 → 80 [RST] Seq=4142039475 Win=0 Len=0

Frame 651: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    652 2022-09-15 19:09:03.564712    10.23.77.129          10.23.77.138          TCP      66     [TCP Retransmission] 36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 652: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    653 2022-09-15 19:09:03.564737    10.23.77.138          10.23.77.129          TCP      66     [TCP Previous segment not captured] [TCP Port numbers reused] 80 → 36704 [SYN, ACK] Seq=1539599868 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 653: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1539599868, Ack: 4142039475, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    654 2022-09-15 19:09:03.564939    10.23.77.129          10.23.77.138          TCP      54     36704 → 80 [RST] Seq=4142039475 Win=0 Len=0

Frame 654: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0

nginx access log

text

10.23.77.129 - - [15/Sep/2022:11:08:42 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:08:52 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:12 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:22 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"

kubelet log

text

I0915 19:09:04.534948  607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

The probe is scheduled to run every 10 seconds, and there should have been access logs in Nginx at 15/Sep/2022:11:09:02. However, there were no access logs recorded in Nginx during that time, and the Kubernetes kubelet logs indicated that the probe at 15/Sep/2022:11:09:02 failed.

The first packet sent by kubelet is a SYN to Nginx, followed by a SYN,ACK response from Nginx. However, the third packet is a direct RST. After a 1-second delay, kubelet retries with a SYN, and Nginx responds with another SYN,ACK, but the final packet remains an RST.

The key breakthrough in resolving the issue lies in understanding why kubelet is sending an RST after receiving a SYN,ACK response from Nginx. It is suspected that this might be due to differences in Kubernetes versions or differences in the Go programming language versions (Kubernetes 1.18 used Go 1.13, while Kubernetes 1.23 used Go 1.17).

A version comparison was made between Kubernetes 1.18.20 and 1.23.10, with a focus on analyzing any modifications related to probes. It was found that the core probe logic remained unchanged, thus ruling out Kubernetes version-related issues.

A comparison was also made between Go 1.16.5 and 1.17.13 to analyze whether there were any changes in the net library. It was observed that the net library had undergone minimal changes.

At this point, further examination of the Go net library was considered, with the question of whether the connection establishment failed, leading to code actively canceling the connection and sending an RST. However, due to the complexity of the code, an online search was performed for “golang RST connection,” leading to the discovery that net.TCPConn.SetLinger() could be used to send an RST.

Forcefully Close TCP Connections in Golang

After three days of investigation, a gut feeling suggested that the direction might be incorrect.

  1. If it were a Kubernetes bug, there should have been numerous community reports, but extensive searches yielded only a few unresolved issues with unknown causes, and they had not been fixed for a long time.
  2. If it were a Go bug, there should have been relevant issues by now. However, there were no such issues, and the net library code showed minimal changes.

A check was performed on the conntrack table to investigate whether port reuse with identical 4-tuples could lead to RST due to overlapping new requests and old records. It was speculated that the kernel mechanism might work this way. Kernel parameters net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 and net.ipv4.tcp_tw_reuse = 1 were checked. This would mean that ports could be reused within two minutes, potentially leading to new connections matching timewait records. Packets sent within the 10 minutes preceding the probe failure were examined, and no duplicate ports were found.

An investigation was also conducted on the ipvs table to check whether RST packets were being sent by ipvs because the port was being listened to by ipvs.

The ipvs table contained larger local random ports being monitored, and the source port 36704 observed in the packet capture during the probe failure was also present in this table.

text

# ipvsadm -l -n | grep TCP | less
TCP  169.254.20.10:20869 rr
TCP  169.254.20.10:21980 rr
TCP  169.254.20.10:23593 rr
TCP  169.254.20.10:24795 rr
TCP  169.254.20.10:24855 rr
TCP  169.254.20.10:28796 rr
TCP  169.254.20.10:29385 rr
TCP  169.254.20.10:29515 rr
TCP  169.254.20.10:31221 rr
TCP  169.254.20.10:32080 rr
TCP  169.254.20.10:32443 rr
TCP  169.254.20.10:35723 rr
TCP  169.254.20.10:35934 rr
TCP  169.254.20.10:36321 rr
TCP  169.254.20.10:36704 rr
TCP  169.254.20.10:37629 rr
TCP  169.254.20.10:38820 rr
TCP  169.254.20.10:39080 rr
TCP  10.10.85.235:20869 rr
TCP  10.10.85.235:21980 rr
TCP  10.10.85.235:23593 rr
TCP  10.10.85.235:24795 rr
TCP  10.10.85.235:24855 rr
TCP  10.10.85.235:28796 rr
TCP  10.10.85.235:29385 rr
TCP  10.10.85.235:29515 rr
TCP  10.10.85.235:31221 rr
TCP  10.10.85.235:32080 rr
TCP  10.10.85.235:32443 rr
TCP  10.10.85.235:35723 rr
TCP  10.10.85.235:35934 rr
TCP  10.10.85.235:36321 rr
TCP  10.10.85.235:36704 rr

These ports are NodePort ports for services:

text

# kubectl  get svc -A |grep :
fe-node-online                        bos-fe-xapi-node-online-1                                         LoadBalancer   10.23.137.123   10.10.112.20    7071:35934/TCP               193d
fe-node-online                        bos-fe-xapi-node-online-2                                         LoadBalancer   10.23.227.91    10.10.121.167   7071:28796/TCP               193d
kube-system                           kubernetes-dashboard                                              NodePort       10.23.254.52    <none>          443:36704/TCP                2y205d
kube-system                           traefik-ingress                                                   LoadBalancer   10.23.192.176   10.10.249.145   80:32080/TCP,443:32443/TCP   287d
kube-system                           traefik-ingress2                                                  LoadBalancer   10.23.209.239   10.10.250.197   80:24855/TCP,443:35723/TCP   211d
kube-system                           traefik-ingress3                                                  LoadBalancer   10.23.216.1     10.10.250.174   80:37629/TCP,443:31221/TCP   211d
kube-system                           traefik-ingress4                                                  LoadBalancer   10.23.214.152   10.10.248.130   80:36321/TCP,443:24795/TCP   211d
kube-system                           traefik-ingress5                                                  LoadBalancer   10.23.134.18    10.10.250.119   80:21980/TCP,443:20869/TCP   211d

Checking the NodePort range configured for the API server:

text

--service-node-port-range=20000-40000

Checking the local port range:

text

net.ipv4.ip_local_port_range = 1024	65000

The overlapping of local port ranges with NodePort ranges led to kubelet, acting as the client, using already allocated NodePort ports for probes. This caused Nginx to respond with a SYN,ACK packet, which was subsequently reset (RST) by ipvs. (ipvs might have thought that if there was no SYN packet, having a SYN,ACK response was invalid and resulted in an RST.)

Is there a higher probability that kubelet, acting as a client for probes, will use already allocated NodePort ports after the upgrade?

To investigate TIME_WAIT connections, it was observed that 10.10.85.105 and 10.23.77.129 had existed before the Kubernetes upgrade, but 169.254.20.10 was a newly added connection:

text

# ss -tan  | grep "TIME-WAIT"|  awk '{print $4}' |awk -F':' '{print $1}' |sort |uniq -c
    161 10.10.85.105  # These are all related to Zabbix
    849 10.23.77.129 # These are all related to kubelet probes
     69 169.254.20.10

text

# ss -tan  | grep "TIME-WAIT" |grep 169.254.20.10
TIME-WAIT  0      0      169.254.20.10:50967              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:56855              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:51293              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:57665              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:49517              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:51675              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:54261              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:53603              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:51469              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:56591              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:56233              169.254.20.10:8080    

Packet capturing was performed on the lo interface, and it was found that a process was using short-lived connections to continuously request 169.254.20.10:8080.

text

# tcpdump -i lo -nn ip host 169.254.20.10 -A
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
23:59:40.058507 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [S], seq 2226425339, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4.)@.@......
...
................|7..............
23:59:40.058520 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [S.], seq 2917381382, ack 2226425340, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
................|7..............
23:59:40.058528 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(.*@.@......
...
............P...|+..
23:59:40.058647 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E....+@[email protected]...
...
............P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


23:59:40.058660 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [.], ack 100, win 511, length 0
E..(..@[email protected]....
...
..........._P...|+..
23:59:40.058754 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]....
...
..........._P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:40 GMT
Content-Length: 19

404 page not found

23:59:40.058766 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(.,@.@......
...
......._....P...|+..
23:59:40.058810 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(.-@.@......
...
......._....P...|+..
23:59:40.058837 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@[email protected]....
...
...........`P...|+..
23:59:40.058842 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0
E..(..@.@......
...
.......`....P...|+..
23:59:41.059195 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [S], seq 1100978375, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4:]@[email protected]...
...
.3..A...........|7..............
23:59:41.059208 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [S.], seq 1286559291, ack 1100978376, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
...3L.V;A.......|7..............
23:59:41.059215 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(:^@[email protected]...
...
.3..A...L.V<P...|+..
23:59:41.059325 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E...:_@.@......
...
.3..A...L.V<P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


23:59:41.059340 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [.], ack 100, win 511, length 0
E..(..@.@......
...
...3L.V<A..+P...|+..
23:59:41.059409 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]...
...
...3L.V<A..+P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:41 GMT
Content-Length: 19

404 page not found

23:59:41.059418 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(:`@.@.._...
...
.3..A..+L.V.P...|+..
23:59:41.059479 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(:a@.@..^...
...
.3..A..+L.V.P...|+..
23:59:41.059501 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@.@......
...
...3L.V.A..,P...|+..
23:59:41.059505 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0

The IP address 169.254.20.10:8080 is being listened to by the node-local-dns (deployed via a DaemonSet with the host network mode enabled). However, when checking with commands like ss and lsof, the client process accessing this address cannot be found. Eventually, after an exhaustive search of all local processes using strace, it was discovered that node-local-dns was continuously making requests to itself:

text

381230 18:49:16 <... nanosleep resumed> NULL) = 0
381230 18:49:16 futex(0xc00021f150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
381210 18:49:16 write(9, "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 99 <unfinished ...>
381203 18:49:16 <... nanosleep resumed> NULL) = 0
381210 18:49:16 <... write resumed> )   = 99
381203 18:49:16 nanosleep({0, 20000},  <unfinished ...>
381190 18:49:16 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=1030661536, u64=139639007389088}}], 128, 201, NULL, 7) = 1
381210 18:49:16 futex(0xc00021f150, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
381190 18:49:16 read(16,  <unfinished ...>
381210 18:49:16 <... futex resumed> )   = 1
381230 18:49:16 <... futex resumed> )   = 0
381190 18:49:16 <... read resumed> "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 4096) = 99

The root cause is that node-local-dns continuously accesses itself, causing local ports to be used rapidly (rolling), eventually leading to the quicker rolling into already allocated NodePort ports.

In the node-local-dns, there is a health configuration that continuously accesses the defined 169.254.20.10:8080 to obtain health response times for metrics. The health check in node-local-dns differs from that in coreDNS, where coreDNS accesses /health, whereas node-local-dns accesses / directly (which returns a 404, making this health check meaningless). This issue was fixed in coreDNS version 1.8.1.

text

cluster.local:53 {
    errors
    cache {
            success 9984 30
            denial 9984 5
    }
    reload
    loop
    bind 169.254.20.10 
    forward . 10.23.252.2 {
            force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
    }

Modify the local port range of the system to avoid overlap with the NodePort range. In this case, the net.ipv4.ip_local_reserved_ports was set to the range of NodePort ports. Alternatively, you can directly modify net.ipv4.ip_local_port_range to ensure it does not overlap with the NodePort range.

Immediate Cause of the Problem: Node-local-dns enabled health checks with the host network mode continuously using short-lived connections to access its own 8080 port, rapidly increasing the local port rolling, which increased the probability of kubelet using already allocated NodePort ports for probes.

Root Cause: The system’s local port range overlapped with the NodePort range, causing kubelet to fail to establish connections when using already allocated NodePort ports for probes (during connection establishment, any response from the pod resulted in a reset, or RST).

Issues:

  1. Overlapping of the system’s local port range with the NodePort range.
  2. Kubernetes upgrade and node-local-dns operations occurring simultaneously.

Lessons Learned:

  1. Minimize changes (keep code commits minimal in size) to reduce the complexity of troubleshooting issues that arise after changes and speed up issue resolution.
  2. When you can’t find an issue, start by investigating environmental issues before considering software bugs

Related Content