kubernetes升级后出现很多pod probe失败

背景

kubernetes集群版本从1.18升级到1.23后出现许多运行正常的pod,由于liveness probe失败导致重启,还有一些是readiness probe失败。

起初怀疑pod重启,不应该是kubernetes升级导致的,由于1.18和1.23版本的container hash算法没有修改,所以升级kubelet不会导致已经运行的container重新生成。后面排查到pod重启时间在kubernetes升级之后,不是升级那个时间。所以可以排除升级kubelet导致的重启。

通过对比新老版本的kubelet日志,发现1.23版本的kubelet的日志中有许多类似这样报错,基本每隔几分钟就有一条

text

I0915 19:09:04.534948  607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

排查过程

从日志报错的文件位置,排查到context deadline exceeded (Client.Timeout exceeded while awaiting headers)net/http库里client报错。

搜索出现这种报错的原因:

  1. 连接已经建立,但是服务端响应慢

    go http client 问题排查

    Go中http超时问题的排查

  2. 连接建立超时

    Go中http超时问题的排查

    go 验证tcp syn超时重传机制

golang http client 连接池

golang http client连接池不复用的问题

golang net/http超时引发大量fin-wait2

同时大概看了net/http库中client和transport部分代码,于是怀疑是不是连接建立超时导致。

部署配置了readiness probe的nginx deployment。为什么要部署nginx?由于nginx有访问日志,同时有错误日志。这样就能看出来kubelet是否进行了probe。

text

        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 80
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2

同时在pod网络命名空间里进行抓包,等待probe失败出现

text

(nsenter -t 972063 -n tcpdump -i eth0 -s0 -G 600 -Z root -w %Y_%m%d_%H%M_%S.pcap &)

功夫不负有心人,终于抓到了probe失败的包

text

No.     Time                          Source                Destination           Protocol Length Info
    649 2022-09-15 19:09:02.533940    10.23.77.129          10.23.77.138          TCP      66     36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 649: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    650 2022-09-15 19:09:02.533962    10.23.77.138          10.23.77.129          TCP      66     80 → 36704 [SYN, ACK] Seq=1523494022 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 650: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1523494022, Ack: 4142039475, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    651 2022-09-15 19:09:02.534203    10.23.77.129          10.23.77.138          TCP      54     36704 → 80 [RST] Seq=4142039475 Win=0 Len=0

Frame 651: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    652 2022-09-15 19:09:03.564712    10.23.77.129          10.23.77.138          TCP      66     [TCP Retransmission] 36704 → 80 [SYN] Seq=4142039474 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 652: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039474, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    653 2022-09-15 19:09:03.564737    10.23.77.138          10.23.77.129          TCP      66     [TCP Previous segment not captured] [TCP Port numbers reused] 80 → 36704 [SYN, ACK] Seq=1539599868 Ack=4142039475 Win=64240 Len=0 MSS=1460 SACK_PERM=1 WS=128

Frame 653: 66 bytes on wire (528 bits), 66 bytes captured (528 bits)
Ethernet II, Src: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05), Dst: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5)
Internet Protocol Version 4, Src: 10.23.77.138, Dst: 10.23.77.129
Transmission Control Protocol, Src Port: 80, Dst Port: 36704, Seq: 1539599868, Ack: 4142039475, Len: 0

No.     Time                          Source                Destination           Protocol Length Info
    654 2022-09-15 19:09:03.564939    10.23.77.129          10.23.77.138          TCP      54     36704 → 80 [RST] Seq=4142039475 Win=0 Len=0

Frame 654: 54 bytes on wire (432 bits), 54 bytes captured (432 bits)
Ethernet II, Src: 26:ed:79:5c:ea:b5 (26:ed:79:5c:ea:b5), Dst: 7e:35:4e:01:11:05 (7e:35:4e:01:11:05)
Internet Protocol Version 4, Src: 10.23.77.129, Dst: 10.23.77.138
Transmission Control Protocol, Src Port: 36704, Dst Port: 80, Seq: 4142039475, Len: 0

nginx访问日志

text

10.23.77.129 - - [15/Sep/2022:11:08:42 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:08:52 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:12 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.23.77.129 - - [15/Sep/2022:11:09:22 +0000] "GET / HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"

kubelet日志

text

I0915 19:09:04.534948  607529 prober.go:121] "Probe failed" probeType="Readiness" pod="default/nginx-5dfb7c896c-x2q2w" podUID=3daea6ef-f286-43f2-8825-51edaa3994d3 containerName="nginx" probeResult=failure output="Get \"http://10.23.77.138:80/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

probe是每隔10秒执行一次,本来在15/Sep/2022:11:09:02应该有访问日志,但是nginx中没有访问日志。同时kubelet日志说明15/Sep/2022:11:09:02执行的probe失败。

第一个包kubelet发出syn到nginx,第二个包nginx响应syn,ack,然后第三个包直接进行rst。经过1s后kubelet进行syn重试,nginx再次响应syn,ack,最后一个包还是响应rst

于是问题的突破关键点在于为什么kubelet对nginx响应syn,ack包进行rst。怀疑是kubernetes版本差异导致的,或golang版本差异导致的(1.18的golang版本为1.13(我们使用1.16.5重新编译了解决goaway出现的5xx),1.23版本的golang版本为1.17)。

对kubernetes 1.18.20和1.23.10进行版本比较,分析有关probe的修改,发现核心probe逻辑没有变化,排除了kubernetes版本导致的问题。

对golang 1.16.5和1.17.13版本进行比较,分析net库是否有修改,发现net库基本没有任何变化。

这时候我再去读了下golang的net库,思考是否是建立连接失败,代码里主动取消连接而发送rst连接呢?

由于代码太复杂没有看下去,去搜索一下golang rst连接,发现使用net.TCPConn.SetLinger()是能够发rst。

Forcefully Close TCP Connections in Golang

这时候已经排查了3天了,凭借直觉,隐隐感觉是方向错误。

  1. 如果kubernetes的bug,那么社区里应该有大量反馈,但是搜了一圈只有一些不知原因的issue,而且很久没有修复

    https://github.com/kubernetes/kubernetes/issues/89898

    https://github.com/kubernetes/kubernetes/issues/110996

  2. 如果是golang bug那么这么久了应该有issue,然而并没有相关issue,而且net库代码没有什么变化。

排查conntrack表,是否因为端口重用且4元组一样,新的请求跟老的记录重合导致rst(这里是猜测内核机制是这样),且内核参数net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 net.ipv4.tcp_tw_reuse = 1。即两分钟内端口重用导致新的连接跟timewait记录一样,查了发送probe失败前10分钟内数据包,没有重复的端口。

排查ipvs表,是否因为端口被ipvs监听了,rst包是ipvs发出的。

ipvs表中有大一些本地随机端口监听,而且抓包中probe失败的源端口36704也在这里面。

text

# ipvsadm -l -n | grep TCP | less
TCP  169.254.20.10:20869 rr
TCP  169.254.20.10:21980 rr
TCP  169.254.20.10:23593 rr
TCP  169.254.20.10:24795 rr
TCP  169.254.20.10:24855 rr
TCP  169.254.20.10:28796 rr
TCP  169.254.20.10:29385 rr
TCP  169.254.20.10:29515 rr
TCP  169.254.20.10:31221 rr
TCP  169.254.20.10:32080 rr
TCP  169.254.20.10:32443 rr
TCP  169.254.20.10:35723 rr
TCP  169.254.20.10:35934 rr
TCP  169.254.20.10:36321 rr
TCP  169.254.20.10:36704 rr
TCP  169.254.20.10:37629 rr
TCP  169.254.20.10:38820 rr
TCP  169.254.20.10:39080 rr
TCP  10.10.85.235:20869 rr
TCP  10.10.85.235:21980 rr
TCP  10.10.85.235:23593 rr
TCP  10.10.85.235:24795 rr
TCP  10.10.85.235:24855 rr
TCP  10.10.85.235:28796 rr
TCP  10.10.85.235:29385 rr
TCP  10.10.85.235:29515 rr
TCP  10.10.85.235:31221 rr
TCP  10.10.85.235:32080 rr
TCP  10.10.85.235:32443 rr
TCP  10.10.85.235:35723 rr
TCP  10.10.85.235:35934 rr
TCP  10.10.85.235:36321 rr
TCP  10.10.85.235:36704 rr

这些端口service的NodePort端口

text

# kubectl  get svc -A |grep :
fe-node-online                        bos-fe-xapi-node-online-1                                         LoadBalancer   10.23.137.123   10.10.112.20    7071:35934/TCP               193d
fe-node-online                        bos-fe-xapi-node-online-2                                         LoadBalancer   10.23.227.91    10.10.121.167   7071:28796/TCP               193d
kube-system                           kubernetes-dashboard                                              NodePort       10.23.254.52    <none>          443:36704/TCP                2y205d
kube-system                           traefik-ingress                                                   LoadBalancer   10.23.192.176   10.10.249.145   80:32080/TCP,443:32443/TCP   287d
kube-system                           traefik-ingress2                                                  LoadBalancer   10.23.209.239   10.10.250.197   80:24855/TCP,443:35723/TCP   211d
kube-system                           traefik-ingress3                                                  LoadBalancer   10.23.216.1     10.10.250.174   80:37629/TCP,443:31221/TCP   211d
kube-system                           traefik-ingress4                                                  LoadBalancer   10.23.214.152   10.10.248.130   80:36321/TCP,443:24795/TCP   211d
kube-system                           traefik-ingress5                                                  LoadBalancer   10.23.134.18    10.10.250.119   80:21980/TCP,443:20869/TCP   211d

查看apiserver配置的NodePort端口范围

text

--service-node-port-range=20000-40000

查看本地local port range

text

net.ipv4.ip_local_port_range = 1024	65000

由于本地local port range的端口和NodePort端口范围重叠,导致kubelet作为客户端进行probe会使用已分配的NodePort端口。导致nginx响应syn、ack包被ipvs rst掉(ipvs是这么想,我没有syn包,就有sync、ack包,这个响应包不合法直接rst)。

kubelet作为客户端进行probe会使用已分配的NodePort端口概率大大上升?

排查TIMEWAIT连接,10.10.85.10510.23.77.129是kubernetes升级之前一直存在, 发现169.254.20.10是新增的连接

text

# ss -tan  | grep "TIME-WAIT"|  awk '{print $4}' |awk -F':' '{print $1}' |sort |uniq -c
    161 10.10.85.105  #这个都是zabbix
    849 10.23.77.129 #这个都是kubelet probe使用的
     69 169.254.20.10

text

# ss -tan  | grep "TIME-WAIT" |grep 169.254.20.10
TIME-WAIT  0      0      169.254.20.10:50967              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:56855              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:51293              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:57665              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:49517              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:51675              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:54261              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:53603              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:51469              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:56591              169.254.20.10:8080               
TIME-WAIT  0      0      169.254.20.10:56233              169.254.20.10:8080    

对lo进行抓包,发现有进程使用短链接一直请求169.254.20.10:8080

text

# tcpdump -i lo -nn ip host 169.254.20.10 -A
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
23:59:40.058507 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [S], seq 2226425339, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4.)@.@......
...
................|7..............
23:59:40.058520 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [S.], seq 2917381382, ack 2226425340, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
................|7..............
23:59:40.058528 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(.*@.@......
...
............P...|+..
23:59:40.058647 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E....+@[email protected]...
...
............P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


23:59:40.058660 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [.], ack 100, win 511, length 0
E..(..@[email protected]....
...
..........._P...|+..
23:59:40.058754 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]....
...
..........._P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:40 GMT
Content-Length: 19

404 page not found

23:59:40.058766 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(.,@.@......
...
......._....P...|+..
23:59:40.058810 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(.-@.@......
...
......._....P...|+..
23:59:40.058837 IP 169.254.20.10.8080 > 169.254.20.10.47507: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@[email protected]....
...
...........`P...|+..
23:59:40.058842 IP 169.254.20.10.47507 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0
E..(..@.@......
...
.......`....P...|+..
23:59:41.059195 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [S], seq 1100978375, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4:]@[email protected]...
...
.3..A...........|7..............
23:59:41.059208 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [S.], seq 1286559291, ack 1100978376, win 65495, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
E..4..@.@......
...
...3L.V;A.......|7..............
23:59:41.059215 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 1, win 512, length 0
E..(:^@[email protected]...
...
.3..A...L.V<P...|+..
23:59:41.059325 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [P.], seq 1:100, ack 1, win 512, length 99: HTTP: GET / HTTP/1.1
E...:_@.@......
...
.3..A...L.V<P...|...GET / HTTP/1.1
Host: 169.254.20.10:8080
User-Agent: Go-http-client/1.1
Accept-Encoding: gzip


23:59:41.059340 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [.], ack 100, win 511, length 0
E..(..@.@......
...
...3L.V<A..+P...|+..
23:59:41.059409 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [P.], seq 1:177, ack 100, win 512, length 176: HTTP: HTTP/1.1 404 Not Found
E.....@[email protected]...
...
...3L.V<A..+P...|...HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Fri, 16 Sep 2022 15:59:41 GMT
Content-Length: 19

404 page not found

23:59:41.059418 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 177, win 511, length 0
E..(:`@.@.._...
...
.3..A..+L.V.P...|+..
23:59:41.059479 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [F.], seq 100, ack 177, win 512, length 0
E..(:a@.@..^...
...
.3..A..+L.V.P...|+..
23:59:41.059501 IP 169.254.20.10.8080 > 169.254.20.10.47667: Flags [F.], seq 177, ack 101, win 512, length 0
E..(..@.@......
...
...3L.V.A..,P...|+..
23:59:41.059505 IP 169.254.20.10.47667 > 169.254.20.10.8080: Flags [.], ack 178, win 512, length 0

169.254.20.10:8080是node-local-dns监听的(使用daemonset部署且网络模式为host),但是客户端通过ss、lsof命令,一直抓不到。最后通过遍历所有本地进程进行strace,发现是node-local-dns自己访问自己

text

381230 18:49:16 <... nanosleep resumed> NULL) = 0
381230 18:49:16 futex(0xc00021f150, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
381210 18:49:16 write(9, "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 99 <unfinished ...>
381203 18:49:16 <... nanosleep resumed> NULL) = 0
381210 18:49:16 <... write resumed> )   = 99
381203 18:49:16 nanosleep({0, 20000},  <unfinished ...>
381190 18:49:16 <... epoll_pwait resumed> [{EPOLLIN|EPOLLOUT, {u32=1030661536, u64=139639007389088}}], 128, 201, NULL, 7) = 1
381210 18:49:16 futex(0xc00021f150, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
381190 18:49:16 read(16,  <unfinished ...>
381210 18:49:16 <... futex resumed> )   = 1
381230 18:49:16 <... futex resumed> )   = 0
381190 18:49:16 <... read resumed> "GET / HTTP/1.1\r\nHost: 169.254.20.10:8080\r\nUser-Agent: Go-http-client/1.1\r\nAccept-Encoding: gzip\r\n\r\n", 4096) = 99

所以根因是因为node-local-dns持续不断的自己访问自己,导致本地端口快速被使用(滚动),最终导致更快的滚动到已经分配的Nodeport端口

在node-local-dns里有health配置,它会一直访问定义的169.254.20.10:8080获得health响应时间,用于metric。node-local-dns的health与coredns的health不一样,coredns会访问/health,node-local-dns访问是/(直接返回404,这个health没有意义,由于node-local-dns 1.21.1使用的1.7版本的coredns代码,这个问题在1.8.1修复了)。

text

cluster.local:53 {
    errors
    cache {
            success 9984 30
            denial 9984 5
    }
    reload
    loop
    bind 169.254.20.10 
    forward . 10.23.252.2 {
            force_tcp
    }
    prometheus :9253
    health 169.254.20.10:8080
    }

修改系统的local port范围与NodePort范围不重叠,我这里是设置net.ipv4.ip_local_reserved_ports为NodePort的端口范围。当然也可以直接修改net.ipv4.ip_local_port_range与NodePort范围不重叠

产生这个问题的直接原因:node-local-dns启用health且网络模式是host,持续不断的使用短连接自己访问自己的8080端口,导致本地端口滚动加快。增加kubelet使用已经分配的NodePort端口进行probe探测概率。

根本原因: 系统的local port范围与NodePort范围重叠,导致kubelet使用已分配的NodePort进行probe探测时候,连接建立不成功(建立连接时,pod响应任何包都会被rst)。

存在问题

  1. 系统的local port范围与NodePort范围重叠
  2. kubernetes升级变更与增加node-local-dns同一时刻操作

经验教训

  1. 最小化变更(类型最小化代码 commit),减少排查变更后产生的问题的复杂度,加快解决问题速度。
  2. 在搜不到issue时候,先排查环境问题,再排查软件bug。

相关内容