many pod probes are failing after upgrading Kubernetes
Background
After upgrading the Kubernetes cluster from version 1.18 to 1.23, many previously running pods are experiencing restarts due to liveness probe failures, and some are failing readiness probes as well.
Initially, it was suspected that the pod restarts were not caused by the Kubernetes upgrade, as the container hash algorithm remained unchanged between versions 1.18 and 1.23. Therefore, upgrading kubelet should not result in the regeneration of already running containers. Further investigation revealed that the timing of pod restarts occurred after the Kubernetes upgrade, but it was not directly caused by the upgrade itself. Thus, it was possible to rule out kubelet upgrade as the cause of the restarts.
By comparing the logs of the old and new versions of kubelet, it was observed that the 1.23 version of kubelet’s logs contained numerous errors similar to the following, with one such error occurring approximately every few minutes:
I0915 19:09:04.534948 607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Investigation Process
1 Analyzing Kubernetes Source Code
From the error message in the logs, it was determined that the error message “context deadline exceeded (Client.Timeout exceeded while awaiting headers)” originated from the net/http
library’s client.
The search for reasons leading to this type of error revealed the following possibilities:
- The connection has been established, but the server response is slow.
- Connection establishment timed out.
Additional references related to HTTP client connection pools and reuse were also considered:
- Golang HTTP Client Connection Pool
- Issues with Golang HTTP Client Connection Pool Not Reusing Connections
- Golang net/http Timeout Causing a Large Number of FIN-WAIT2 States
Based on this analysis, it was suspected that the issue might be related to connection establishment timeouts.
2 Packet Capture for Reproduction
A deployment of Nginx with readiness probes configured was deployed. Why Nginx? This is because Nginx has access logs and error logs, making it possible to determine whether kubelet is performing the probe correctly.
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
Additionally, packet capturing was performed within the pod’s network namespace, waiting for probe failures to occur:
(nsenter -t 972063 -n tcpdump -i eth0 -s0 -G 600 -Z root -w %Y_%m%d_%H%M_%S.pcap &)
With determination and effort, the capture of packets corresponding to probe failures was finally achieved.
No. Time Source Destination Protocol Length Info
649 2022-09-15 19:09:02.533940 10.23.77.129 10.23.77.138 TCP 66 36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 649: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0
No. Time Source Destination Protocol Length Info
650 2022-09-15 19:09:02.533962 10.23.77.138 10.23.77.129 TCP 66 80 → 36704 [SYN, ACK] Seq=1523494022 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 650: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1523494022, Ack: 4142039475, Len: 0
No. Time Source Destination Protocol Length Info
651 2022-09-15 19:09:02.534203 10.23.77.129 10.23.77.138 TCP 54 36704 → 80 [RST] Seq=4142039475 Win=0 Len=0
Frame 651: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0
No. Time Source Destination Protocol Length Info
652 2022-09-15 19:09:03.564712 10.23.77.129 10.23.77.138 TCP 66 [TCP Retransmission] 36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 652: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0
No. Time Source Destination Protocol Length Info
653 2022-09-15 19:09:03.564737 10.23.77.138 10.23.77.129 TCP 66 [TCP Previous segment not captured] [TCP Port numbers reused] 80 → 36704 [SYN, ACK] Seq=1539599868 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 653: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1539599868, Ack: 4142039475, Len: 0
No. Time Source Destination Protocol Length Info
654 2022-09-15 19:09:03.564939 10.23.77.129 10.23.77.138 TCP 54 36704 → 80 [RST] Seq=4142039475 Win=0 Len=0
Frame 654: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0
nginx access log
10.23.77.129 - - [15/Sep/2022:11:08:42 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:08:52 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:12 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:22 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
kubelet log
I0915 19:09:04.534948 607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2.1 Analysis of the Phenomenon
The probe is scheduled to run every 10 seconds, and there should have been access logs in Nginx at 15/Sep/2022:11:09:02
. However, there were no access logs recorded in Nginx during that time, and the Kubernetes kubelet logs indicated that the probe at 15/Sep/2022:11:09:02
failed.
The first packet sent by kubelet is a SYN
to Nginx, followed by a SYN,ACK
response from Nginx. However, the third packet is a direct RST
. After a 1-second delay, kubelet retries with a SYN
, and Nginx responds with another SYN,ACK
, but the final packet remains an RST
.
The key breakthrough in resolving the issue lies in understanding why kubelet is sending an RST
after receiving a SYN,ACK
response from Nginx. It is suspected that this might be due to differences in Kubernetes versions or differences in the Go programming language versions (Kubernetes 1.18 used Go 1.13, while Kubernetes 1.23 used Go 1.17).
3 Comparison of Kubernetes and Go Source Code
A version comparison was made between Kubernetes 1.18.20 and 1.23.10, with a focus on analyzing any modifications related to probes. It was found that the core probe logic remained unchanged, thus ruling out Kubernetes version-related issues.
A comparison was also made between Go 1.16.5 and 1.17.13 to analyze whether there were any changes in the net library. It was observed that the net library had undergone minimal changes.
At this point, further examination of the Go net library was considered, with the question of whether the connection establishment failed, leading to code actively canceling the connection and sending an RST
. However, due to the complexity of the code, an online search was performed for “golang RST connection,” leading to the discovery that net.TCPConn.SetLinger()
could be used to send an RST
.
Forcefully Close TCP Connections in Golang
After three days of investigation, a gut feeling suggested that the direction might be incorrect.
- If it were a Kubernetes bug, there should have been numerous community reports, but extensive searches yielded only a few unresolved issues with unknown causes, and they had not been fixed for a long time.
- If it were a Go bug, there should have been relevant issues by now. However, there were no such issues, and the net library code showed minimal changes.
4 Kernel Investigation
A check was performed on the conntrack table to investigate whether port reuse with identical 4-tuples could lead to RST due to overlapping new requests and old records. It was speculated that the kernel mechanism might work this way. Kernel parameters net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
and net.ipv4.tcp_tw_reuse = 1
were checked. This would mean that ports could be reused within two minutes, potentially leading to new connections matching timewait records. Packets sent within the 10 minutes preceding the probe failure were examined, and no duplicate ports were found.
An investigation was also conducted on the ipvs table to check whether RST packets were being sent by ipvs because the port was being listened to by ipvs.
The ipvs table contained larger local random ports being monitored, and the source port 36704
observed in the packet capture during the probe failure was also present in this table.
# ipvsadm -l -n | grep TCP | less
TCP 169.254.20.10:20869 rr
TCP 169.254.20.10:21980 rr
TCP 169.254.20.10:23593 rr
TCP 169.254.20.10:24795 rr
TCP 169.254.20.10:24855 rr
TCP 169.254.20.10:28796 rr
TCP 169.254.20.10:29385 rr
TCP 169.254.20.10:29515 rr
TCP 169.254.20.10:31221 rr
TCP 169.254.20.10:32080 rr
TCP 169.254.20.10:32443 rr
TCP 169.254.20.10:35723 rr
TCP 169.254.20.10:35934 rr
TCP 169.254.20.10:36321 rr
TCP 169.254.20.10:36704 rr
TCP 169.254.20.10:37629 rr
TCP 169.254.20.10:38820 rr
TCP 169.254.20.10:39080 rr
TCP 10.10.85.235:20869 rr
TCP 10.10.85.235:21980 rr
TCP 10.10.85.235:23593 rr
TCP 10.10.85.235:24795 rr
TCP 10.10.85.235:24855 rr
TCP 10.10.85.235:28796 rr
TCP 10.10.85.235:29385 rr
TCP 10.10.85.235:29515 rr
TCP 10.10.85.235:31221 rr
TCP 10.10.85.235:32080 rr
TCP 10.10.85.235:32443 rr
TCP 10.10.85.235:35723 rr
TCP 10.10.85.235:35934 rr
TCP 10.10.85.235:36321 rr
TCP 10.10.85.235:36704 rr
4.1 Phenomenon Analysis
These ports are NodePort ports for services:
# kubectl get svc -A |grep :
fe-node-online bos-fe-xapi-node-online-1 LoadBalancer 10.23.137.123 10.10.112.20 7071:35934/TCP 193d
fe-node-online bos-fe-xapi-node-online-2 LoadBalancer 10.23.227.91 10.10.121.167 7071:28796/TCP 193d
kube-system kubernetes-dashboard NodePort 10.23.254.52 <none> 443:36704/TCP 2y205d
kube-system traefik-ingress LoadBalancer 10.23.192.176 10.10.249.145 80:32080/TCP,443:32443/TCP 287d
kube-system traefik-ingress2 LoadBalancer 10.23.209.239 10.10.250.197 80:24855/TCP,443:35723/TCP 211d
kube-system traefik-ingress3 LoadBalancer 10.23.216.1 10.10.250.174 80:37629/TCP,443:31221/TCP 211d
kube-system traefik-ingress4 LoadBalancer 10.23.214.152 10.10.248.130 80:36321/TCP,443:24795/TCP 211d
kube-system traefik-ingress5 LoadBalancer 10.23.134.18 10.10.250.119 80:21980/TCP,443:20869/TCP 211d
Checking the NodePort range configured for the API server:
--service-node-port-range=20000-40000
Checking the local port range:
net.ipv4.ip_local_port_range = 1024 65000
5 Root Cause Analysis
The overlapping of local port ranges with NodePort ranges led to kubelet, acting as the client, using already allocated NodePort ports for probes. This caused Nginx to respond with a SYN,ACK
packet, which was subsequently reset (RST) by ipvs. (ipvs might have thought that if there was no SYN
packet, having a SYN,ACK
response was invalid and resulted in an RST.)
5.1 Why Did This Issue Become Evident After the Upgrade?
Is there a higher probability that kubelet, acting as a client for probes, will use already allocated NodePort ports after the upgrade?
To investigate TIME_WAIT connections, it was observed that 10.10.85.105
and 10.23.77.129
had existed before the Kubernetes upgrade, but 169.254.20.10
was a newly added connection:
# ss -tan | grep "TIME-WAIT"| awk '{print $4}' |awk -F':' '{print $1}' |sort |uniq -c
161 10.10.85.105 # These are all related to Zabbix
849 10.23.77.129 # These are all related to kubelet probes
69 169.254.20.10
# ss -tan | grep "TIME-WAIT" |grep 169.254.20.10
TIME-WAIT 0 0 169.254.20.10:50967 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:56855 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:51293 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:57665 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:49517 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:51675 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:54261 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:53603 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:51469 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:56591 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:56233 169.254.20.10:8080
Packet capturing was performed on the lo
interface, and it was found that a process was using short-lived connections to continuously request 169.254.20.10:8080
.
# tcpdump -i lo -nn ip host 169.254.20.10 -A
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
23:59:40.058507 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [S], seq 2226425339, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4.)@.@......
...
................|7..............
23:59:40.058520 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [S.], seq 2917381382, ack 2226425340, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
................|7..............
23:59:40.058528 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(.*@.@......
...
............P...|+..
23:59:40.058647 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E....+@[email protected]...
...
............P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
23:59:40.058660 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [.], ack 100, win 511, length 0
E..(..@[email protected]....
...
..........._P...|+..
23:59:40.058754 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]....
...
..........._P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:40 GMT
Content-Length: 19
404 page not found
23:59:40.058766 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(.,@.@......
...
......._....P...|+..
23:59:40.058810 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(.-@.@......
...
......._....P...|+..
23:59:40.058837 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@[email protected]....
...
...........`P...|+..
23:59:40.058842 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0
E..(..@.@......
...
.......`....P...|+..
23:59:41.059195 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [S], seq 1100978375, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4:]@[email protected]...
...
.3..A...........|7..............
23:59:41.059208 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [S.], seq 1286559291, ack 1100978376, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
...3L.V;A.......|7..............
23:59:41.059215 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(:^@[email protected]...
...
.3..A...L.V<P...|+..
23:59:41.059325 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E...:_@.@......
...
.3..A...L.V<P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
23:59:41.059340 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [.], ack 100, win 511, length 0
E..(..@.@......
...
...3L.V<A..+P...|+..
23:59:41.059409 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]...
...
...3L.V<A..+P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:41 GMT
Content-Length: 19
404 page not found
23:59:41.059418 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(:`@.@.._...
...
.3..A..+L.V.P...|+..
23:59:41.059479 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(:a@.@..^...
...
.3..A..+L.V.P...|+..
23:59:41.059501 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@.@......
...
...3L.V.A..,P...|+..
23:59:41.059505 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0
The IP address 169.254.20.10:8080
is being listened to by the node-local-dns (deployed via a DaemonSet with the host network mode enabled). However, when checking with commands like ss
and lsof
, the client process accessing this address cannot be found. Eventually, after an exhaustive search of all local processes using strace
, it was discovered that node-local-dns was continuously making requests to itself:
381230 18:49:16 <... nanosleep resumed> NULL) = 0
381230 18:49:16 futex(0xc00021f150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
381210 18:49:16 write(9, "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 99 <unfinished ...>
381203 18:49:16 <... nanosleep resumed> NULL) = 0
381210 18:49:16 <... write resumed> ) = 99
381203 18:49:16 nanosleep({0, 20000}, <unfinished ...>
381190 18:49:16 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=1030661536, u64=139639007389088}}], 128, 201, NULL, 7) = 1
381210 18:49:16 futex(0xc00021f150, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
381190 18:49:16 read(16, <unfinished ...>
381210 18:49:16 <... futex resumed> ) = 1
381230 18:49:16 <... futex resumed> ) = 0
381190 18:49:16 <... read resumed> "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 4096) = 99
The root cause is that node-local-dns continuously accesses itself, causing local ports to be used rapidly (rolling), eventually leading to the quicker rolling into already allocated NodePort ports.
5.2 Why Was Node-local-dns Continuously Accessing Itself?
In the node-local-dns, there is a health configuration that continuously accesses the defined 169.254.20.10:8080
to obtain health response times for metrics. The health check in node-local-dns differs from that in coreDNS, where coreDNS accesses /health
, whereas node-local-dns accesses /
directly (which returns a 404, making this health check meaningless). This issue was fixed in coreDNS version 1.8.1.
cluster.local:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10
forward . 10.23.252.2 {
force_tcp
}
prometheus :9253
health 169.254.20.10:8080
}
6 Solution
Modify the local port range of the system to avoid overlap with the NodePort range. In this case, the net.ipv4.ip_local_reserved_ports
was set to the range of NodePort ports. Alternatively, you can directly modify net.ipv4.ip_local_port_range
to ensure it does not overlap with the NodePort range.
7 Summary
Immediate Cause of the Problem: Node-local-dns enabled health checks with the host network mode continuously using short-lived connections to access its own 8080 port, rapidly increasing the local port rolling, which increased the probability of kubelet using already allocated NodePort ports for probes.
Root Cause: The system’s local port range overlapped with the NodePort range, causing kubelet to fail to establish connections when using already allocated NodePort ports for probes (during connection establishment, any response from the pod resulted in a reset, or RST).
Issues:
- Overlapping of the system’s local port range with the NodePort range.
- Kubernetes upgrade and node-local-dns operations occurring simultaneously.
Lessons Learned:
- Minimize changes (keep code commits minimal in size) to reduce the complexity of troubleshooting issues that arise after changes and speed up issue resolution.
- When you can’t find an issue, start by investigating environmental issues before considering software bugs