kubernetes升级后出现很多pod probe失败
背景
kubernetes集群版本从1.18升级到1.23后出现许多运行正常的pod,由于liveness probe失败导致重启,还有一些是readiness probe失败。
起初怀疑pod重启,不应该是kubernetes升级导致的,由于1.18和1.23版本的container hash算法没有修改,所以升级kubelet不会导致已经运行的container重新生成。后面排查到pod重启时间在kubernetes升级之后,不是升级那个时间。所以可以排除升级kubelet导致的重启。
通过对比新老版本的kubelet日志,发现1.23版本的kubelet的日志中有许多类似这样报错,基本每隔几分钟就有一条
I0915 19:09:04.534948 607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
排查过程
1 分析kubernetes源码
从日志报错的文件位置,排查到context deadline exceeded (Client.Timeout exceeded while awaiting headers)
是net/http
库里client报错。
搜索出现这种报错的原因:
连接已经建立,但是服务端响应慢
连接建立超时
golang net/http超时引发大量fin-wait2
同时大概看了net/http库中client和transport部分代码,于是怀疑是不是连接建立超时导致。
2 抓包进行重现
部署配置了readiness probe的nginx deployment。为什么要部署nginx?由于nginx有访问日志,同时有错误日志。这样就能看出来kubelet是否进行了probe。
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 80
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
同时在pod网络命名空间里进行抓包,等待probe失败出现
(nsenter -t 972063 -n tcpdump -i eth0 -s0 -G 600 -Z root -w %Y_%m%d_%H%M_%S.pcap &)
功夫不负有心人,终于抓到了probe失败的包
No. Time Source Destination Protocol Length Info
649 2022-09-15 19:09:02.533940 10.23.77.129 10.23.77.138 TCP 66 36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 649: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0
No. Time Source Destination Protocol Length Info
650 2022-09-15 19:09:02.533962 10.23.77.138 10.23.77.129 TCP 66 80 → 36704 [SYN, ACK] Seq=1523494022 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 650: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1523494022, Ack: 4142039475, Len: 0
No. Time Source Destination Protocol Length Info
651 2022-09-15 19:09:02.534203 10.23.77.129 10.23.77.138 TCP 54 36704 → 80 [RST] Seq=4142039475 Win=0 Len=0
Frame 651: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0
No. Time Source Destination Protocol Length Info
652 2022-09-15 19:09:03.564712 10.23.77.129 10.23.77.138 TCP 66 [TCP Retransmission] 36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 652: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0
No. Time Source Destination Protocol Length Info
653 2022-09-15 19:09:03.564737 10.23.77.138 10.23.77.129 TCP 66 [TCP Previous segment not captured] [TCP Port numbers reused] 80 → 36704 [SYN, ACK] Seq=1539599868 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128
Frame 653: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1539599868, Ack: 4142039475, Len: 0
No. Time Source Destination Protocol Length Info
654 2022-09-15 19:09:03.564939 10.23.77.129 10.23.77.138 TCP 54 36704 → 80 [RST] Seq=4142039475 Win=0 Len=0
Frame 654: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0
nginx访问日志
10.23.77.129 - - [15/Sep/2022:11:08:42 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:08:52 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:12 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:22 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
kubelet日志
I0915 19:09:04.534948 607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2.1 现象分析
probe是每隔10秒执行一次,本来在15/Sep/2022:11:09:02
应该有访问日志,但是nginx中没有访问日志。同时kubelet日志说明15/Sep/2022:11:09:02
执行的probe失败。
第一个包kubelet发出syn
到nginx,第二个包nginx响应syn,ack
,然后第三个包直接进行rst
。经过1s后kubelet进行syn
重试,nginx再次响应syn,ack
,最后一个包还是响应rst
。
于是问题的突破关键点在于为什么kubelet对nginx响应syn,ack
包进行rst。怀疑是kubernetes版本差异导致的,或golang版本差异导致的(1.18的golang版本为1.13(我们使用1.16.5重新编译了解决goaway出现的5xx),1.23版本的golang版本为1.17)。
3 对比kubernetes源码和golang源码
对kubernetes 1.18.20和1.23.10进行版本比较,分析有关probe的修改,发现核心probe逻辑没有变化,排除了kubernetes版本导致的问题。
对golang 1.16.5和1.17.13版本进行比较,分析net库是否有修改,发现net库基本没有任何变化。
这时候我再去读了下golang的net库,思考是否是建立连接失败,代码里主动取消连接而发送rst连接呢?
由于代码太复杂没有看下去,去搜索一下golang rst连接
,发现使用net.TCPConn.SetLinger()
是能够发rst。
Forcefully Close TCP Connections in Golang
这时候已经排查了3天了,凭借直觉,隐隐感觉是方向错误。
如果kubernetes的bug,那么社区里应该有大量反馈,但是搜了一圈只有一些不知原因的issue,而且很久没有修复
如果是golang bug那么这么久了应该有issue,然而并没有相关issue,而且net库代码没有什么变化。
4 排查系统内核
排查conntrack表,是否因为端口重用且4元组一样,新的请求跟老的记录重合导致rst(这里是猜测内核机制是这样),且内核参数net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 net.ipv4.tcp_tw_reuse = 1
。即两分钟内端口重用导致新的连接跟timewait记录一样,查了发送probe失败前10分钟内数据包,没有重复的端口。
排查ipvs表,是否因为端口被ipvs监听了,rst包是ipvs发出的。
ipvs表中有大一些本地随机端口监听,而且抓包中probe失败的源端口36704
也在这里面。
# ipvsadm -l -n | grep TCP | less
TCP 169.254.20.10:20869 rr
TCP 169.254.20.10:21980 rr
TCP 169.254.20.10:23593 rr
TCP 169.254.20.10:24795 rr
TCP 169.254.20.10:24855 rr
TCP 169.254.20.10:28796 rr
TCP 169.254.20.10:29385 rr
TCP 169.254.20.10:29515 rr
TCP 169.254.20.10:31221 rr
TCP 169.254.20.10:32080 rr
TCP 169.254.20.10:32443 rr
TCP 169.254.20.10:35723 rr
TCP 169.254.20.10:35934 rr
TCP 169.254.20.10:36321 rr
TCP 169.254.20.10:36704 rr
TCP 169.254.20.10:37629 rr
TCP 169.254.20.10:38820 rr
TCP 169.254.20.10:39080 rr
TCP 10.10.85.235:20869 rr
TCP 10.10.85.235:21980 rr
TCP 10.10.85.235:23593 rr
TCP 10.10.85.235:24795 rr
TCP 10.10.85.235:24855 rr
TCP 10.10.85.235:28796 rr
TCP 10.10.85.235:29385 rr
TCP 10.10.85.235:29515 rr
TCP 10.10.85.235:31221 rr
TCP 10.10.85.235:32080 rr
TCP 10.10.85.235:32443 rr
TCP 10.10.85.235:35723 rr
TCP 10.10.85.235:35934 rr
TCP 10.10.85.235:36321 rr
TCP 10.10.85.235:36704 rr
4.1 现象分析
这些端口service的NodePort端口
# kubectl get svc -A |grep :
fe-node-online bos-fe-xapi-node-online-1 LoadBalancer 10.23.137.123 10.10.112.20 7071:35934/TCP 193d
fe-node-online bos-fe-xapi-node-online-2 LoadBalancer 10.23.227.91 10.10.121.167 7071:28796/TCP 193d
kube-system kubernetes-dashboard NodePort 10.23.254.52 <none> 443:36704/TCP 2y205d
kube-system traefik-ingress LoadBalancer 10.23.192.176 10.10.249.145 80:32080/TCP,443:32443/TCP 287d
kube-system traefik-ingress2 LoadBalancer 10.23.209.239 10.10.250.197 80:24855/TCP,443:35723/TCP 211d
kube-system traefik-ingress3 LoadBalancer 10.23.216.1 10.10.250.174 80:37629/TCP,443:31221/TCP 211d
kube-system traefik-ingress4 LoadBalancer 10.23.214.152 10.10.248.130 80:36321/TCP,443:24795/TCP 211d
kube-system traefik-ingress5 LoadBalancer 10.23.134.18 10.10.250.119 80:21980/TCP,443:20869/TCP 211d
查看apiserver配置的NodePort端口范围
--service-node-port-range=20000-40000
查看本地local port range
net.ipv4.ip_local_port_range = 1024 65000
5 原因分析
由于本地local port range的端口和NodePort端口范围重叠,导致kubelet作为客户端进行probe会使用已分配的NodePort端口。导致nginx响应syn、ack包
被ipvs rst掉(ipvs是这么想,我没有syn包,就有sync、ack
包,这个响应包不合法直接rst)。
5.1 为什么升级之后这个问题凸显?
kubelet作为客户端进行probe会使用已分配的NodePort端口概率大大上升?
排查TIMEWAIT连接,10.10.85.105
和10.23.77.129
是kubernetes升级之前一直存在, 发现169.254.20.10
是新增的连接
# ss -tan | grep "TIME-WAIT"| awk '{print $4}' |awk -F':' '{print $1}' |sort |uniq -c
161 10.10.85.105 #这个都是zabbix
849 10.23.77.129 #这个都是kubelet probe使用的
69 169.254.20.10
# ss -tan | grep "TIME-WAIT" |grep 169.254.20.10
TIME-WAIT 0 0 169.254.20.10:50967 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:56855 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:51293 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:57665 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:49517 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:51675 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:54261 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:53603 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:51469 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:56591 169.254.20.10:8080
TIME-WAIT 0 0 169.254.20.10:56233 169.254.20.10:8080
对lo进行抓包,发现有进程使用短链接一直请求169.254.20.10:8080
# tcpdump -i lo -nn ip host 169.254.20.10 -A
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
23:59:40.058507 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [S], seq 2226425339, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4.)@.@......
...
................|7..............
23:59:40.058520 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [S.], seq 2917381382, ack 2226425340, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
................|7..............
23:59:40.058528 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(.*@.@......
...
............P...|+..
23:59:40.058647 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E....+@[email protected]...
...
............P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
23:59:40.058660 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [.], ack 100, win 511, length 0
E..(..@[email protected]....
...
..........._P...|+..
23:59:40.058754 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]....
...
..........._P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:40 GMT
Content-Length: 19
404 page not found
23:59:40.058766 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(.,@.@......
...
......._....P...|+..
23:59:40.058810 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(.-@.@......
...
......._....P...|+..
23:59:40.058837 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@[email protected]....
...
...........`P...|+..
23:59:40.058842 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0
E..(..@.@......
...
.......`....P...|+..
23:59:41.059195 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [S], seq 1100978375, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4:]@[email protected]...
...
.3..A...........|7..............
23:59:41.059208 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [S.], seq 1286559291, ack 1100978376, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
...3L.V;A.......|7..............
23:59:41.059215 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(:^@[email protected]...
...
.3..A...L.V<P...|+..
23:59:41.059325 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E...:_@.@......
...
.3..A...L.V<P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip
23:59:41.059340 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [.], ack 100, win 511, length 0
E..(..@.@......
...
...3L.V<A..+P...|+..
23:59:41.059409 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]...
...
...3L.V<A..+P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:41 GMT
Content-Length: 19
404 page not found
23:59:41.059418 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(:`@.@.._...
...
.3..A..+L.V.P...|+..
23:59:41.059479 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(:a@.@..^...
...
.3..A..+L.V.P...|+..
23:59:41.059501 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@.@......
...
...3L.V.A..,P...|+..
23:59:41.059505 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0
169.254.20.10:8080是node-local-dns监听的(使用daemonset部署且网络模式为host),但是客户端通过ss、lsof命令,一直抓不到。最后通过遍历所有本地进程进行strace,发现是node-local-dns自己访问自己
381230 18:49:16 <... nanosleep resumed> NULL) = 0
381230 18:49:16 futex(0xc00021f150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
381210 18:49:16 write(9, "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 99 <unfinished ...>
381203 18:49:16 <... nanosleep resumed> NULL) = 0
381210 18:49:16 <... write resumed> ) = 99
381203 18:49:16 nanosleep({0, 20000}, <unfinished ...>
381190 18:49:16 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=1030661536, u64=139639007389088}}], 128, 201, NULL, 7) = 1
381210 18:49:16 futex(0xc00021f150, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
381190 18:49:16 read(16, <unfinished ...>
381210 18:49:16 <... futex resumed> ) = 1
381230 18:49:16 <... futex resumed> ) = 0
381190 18:49:16 <... read resumed> "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 4096) = 99
所以根因是因为node-local-dns持续不断的自己访问自己,导致本地端口快速被使用(滚动),最终导致更快的滚动到已经分配的Nodeport端口
5.2 为什么node-local-dns持续不断的自己访问自己?
在node-local-dns里有health配置,它会一直访问定义的169.254.20.10:8080
获得health响应时间,用于metric。node-local-dns的health与coredns的health不一样,coredns会访问/health
,node-local-dns访问是/
(直接返回404,这个health没有意义,由于node-local-dns 1.21.1使用的1.7版本的coredns代码,这个问题在1.8.1修复了)。
cluster.local:53 {
errors
cache {
success 9984 30
denial 9984 5
}
reload
loop
bind 169.254.20.10
forward . 10.23.252.2 {
force_tcp
}
prometheus :9253
health 169.254.20.10:8080
}
6 解决方案
修改系统的local port范围与NodePort范围不重叠,我这里是设置net.ipv4.ip_local_reserved_ports
为NodePort的端口范围。当然也可以直接修改net.ipv4.ip_local_port_range
与NodePort范围不重叠
7 总结
产生这个问题的直接原因:node-local-dns启用health且网络模式是host,持续不断的使用短连接自己访问自己的8080端口,导致本地端口滚动加快。增加kubelet使用已经分配的NodePort端口进行probe探测概率。
根本原因: 系统的local port范围与NodePort范围重叠,导致kubelet使用已分配的NodePort进行probe探测时候,连接建立不成功(建立连接时,pod响应任何包都会被rst)。
存在问题
- 系统的local port范围与NodePort范围重叠
- kubernetes升级变更与增加node-local-dns同一时刻操作
经验教训
- 最小化变更(类型最小化代码 commit),减少排查变更后产生的问题的复杂度,加快解决问题速度。
- 在搜不到issue时候,先排查环境问题,再排查软件bug。