kubenet ip泄漏

最近在升级docker版本之后,出现了一些pod一直处于pending状态,查看pod pending的原因是无法获得ip。最后排查发现升级docker版本的操作方式不对, 导致了kubenet ip泄漏,进而导致没有ip可以分配。

kubernetes版本1.18.8,网络模式为kubenet,每个node最大pod数量为125个,podcidr为25。

这个pod的event显示pod failed to create sandbox,sandbox一直不断创建和销毁,describe pod处于pending状态:

shell

Warning  FailedCreatePodSandBox  3m20s (x30245 over 9h)  kubelet, 10.12.97.31  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "db90a3a26c158a70e4d251336fa62f9f32f7b0643a6ad23d52cdfea5e96c3412" network for pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l": networkPlugin kubenet failed to set up pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l_saas-o2o-public-tomcat-dev" network: error adding container to network: failed to allocate for range 0: no IP addresses available in range set: 10.253.6.129-10.253.6.254

根据上面信息,应该是node节点的ip已经分配完了。

查看node节点的分配IP数量:

shell

# kubectl get pod -A -o custom-columns=:status.podIP --no-headers  --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>" |sort |wc -l
118

发现IP地址并没有用完,因为podcidr为25,一共有125个地址可以用,其中有三个地址无法使用,包括网络段ip、网关ip和广播ip。

要排查这个问题,需要知道kubenet是如何分配ip的?

kubenet是一种特殊的cni插件,它集成在kubelet中,只支持docker运行时,它将bridge和lo和host-local,portmap网络插件进行封装成新的cni插件。

kubenet使用的host-local插件进行ip分配,它会在/var/lib/cni/networks/kubenet 生成ip为文件名的文件,last_reserved_ip.0lock

  • ip文件–保存已经分配ip,文件里保存容器id和网卡名称
  • last_reserved_ip.0保存最后分配的ip,下次分配ip,从这个ip后面开始寻找可用ip。
  • lock锁文件,避免同时分配ip导致冲突

其中cni插件还会在/var/lib/cni/cache/results/里 ,每个容器生成两个文件名为kubenet-{容器id}-eth0kubenet-loopback-{容器id}-lo ,里面保存各个容器的网卡的ip信息。

比如说 分配了10.253.9.21

shell

# cat /var/lib/cni/networks/kubenet/10.253.9.21
1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e
eth0

# cat /var/lib/cni/cache/results/kubenet-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-eth0
{"cniVersion":"0.2.0","ip4":{"ip":"10.253.9.21/25","gateway":"10.253.9.1","routes":[{"dst":"0.0.0.0/0"}]},"dns":{}}

# cat /var/lib/cni/cache/results/kubenet-loopback-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-lo
{"cniVersion":"0.2.0","ip4":{"ip":"127.0.0.1/8"},"dns":{}}

了解kubenet原理之后,查看node已经分配的ip

shell

# ls /var/lib/cni/networks/kubenet/
10.253.6.130  10.253.6.141  10.253.6.152  10.253.6.163  10.253.6.174  10.253.6.185  10.253.6.196  10.253.6.207  10.253.6.218  10.253.6.229  10.253.6.240  10.253.6.251
10.253.6.131  10.253.6.142  10.253.6.153  10.253.6.164  10.253.6.175  10.253.6.186  10.253.6.197  10.253.6.208  10.253.6.219  10.253.6.230  10.253.6.241  10.253.6.252
10.253.6.132  10.253.6.143  10.253.6.154  10.253.6.165  10.253.6.176  10.253.6.187  10.253.6.198  10.253.6.209  10.253.6.220  10.253.6.231  10.253.6.242  10.253.6.253
10.253.6.133  10.253.6.144  10.253.6.155  10.253.6.166  10.253.6.177  10.253.6.188  10.253.6.199  10.253.6.210  10.253.6.221  10.253.6.232  10.253.6.243  10.253.6.254
10.253.6.134  10.253.6.145  10.253.6.156  10.253.6.167  10.253.6.178  10.253.6.189  10.253.6.200  10.253.6.211  10.253.6.222  10.253.6.233  10.253.6.244  last_reserved_ip.0
10.253.6.135  10.253.6.146  10.253.6.157  10.253.6.168  10.253.6.179  10.253.6.190  10.253.6.201  10.253.6.212  10.253.6.223  10.253.6.234  10.253.6.245  lock
10.253.6.136  10.253.6.147  10.253.6.158  10.253.6.169  10.253.6.180  10.253.6.191  10.253.6.202  10.253.6.213  10.253.6.224  10.253.6.235  10.253.6.246
10.253.6.137  10.253.6.148  10.253.6.159  10.253.6.170  10.253.6.181  10.253.6.192  10.253.6.203  10.253.6.214  10.253.6.225  10.253.6.236  10.253.6.247
10.253.6.138  10.253.6.149  10.253.6.160  10.253.6.171  10.253.6.182  10.253.6.193  10.253.6.204  10.253.6.215  10.253.6.226  10.253.6.237  10.253.6.248
10.253.6.139  10.253.6.150  10.253.6.161  10.253.6.172  10.253.6.183  10.253.6.194  10.253.6.205  10.253.6.216  10.253.6.227  10.253.6.238  10.253.6.249
10.253.6.140  10.253.6.151  10.253.6.162  10.253.6.173  10.253.6.184  10.253.6.195  10.253.6.206  10.253.6.217  10.253.6.228  10.253.6.239  10.253.6.250

发现的确所有可用的ip都已经被分配了,但实际使用ip数量与已分配ip对不上,也就是说 这些ip没有被回收。

查找已经分配但未使用的ip:

shell

#要求本地有kubectl,不推荐
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers  --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
    if ! echo $ips | grep "$ip" &>/dev/null;then \
    echo $ip; \
    fi; \
done

#推荐
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//');  if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip;fi;done

#输出
10.253.6.130
10.253.6.131
...

根据查找出来ip,确认容器时否存在,发现容器的确不存在

sh

cat /var/lib/cni/networks/kubenet/10.253.6.130
950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
eth0

docker ps -a |grep 950b9

回收ip发生在kubelet对pod进行销毁动作,在pod销毁时候,kubelet会调用cni插件,进行ip回收。而出现ip泄漏,一定是这个阶段执行失败导致的。

kubelet里gc,evicted,pod的delete都会触发销毁pod。

查找kubelet日志发现,刚启动的时候就有一些报错日志:

有两个关键日志Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods

从日志读出来,在kubelet重新启动的时候, 网络插件还未初始化–podCIDR未获取到,然后就开始停止node上面的已有的pod,而kubenet插件需要podCIDR来tear down pod,所以报错了。

sh

#这些日志是在kubelet刚启动时候打印的
I0326 19:16:59.441752  340113 server.go:393] Adding debug handlers to kubelet server.
E0326 19:16:59.452246  340113 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR #kubelet检测到kubenet网络插件还没有完成初始化
I0326 19:16:59.452348  340113 clientconn.go:106] parsed scheme: "unix"
I0326 19:16:59.452354  340113 clientconn.go:106] scheme "unix" not registered, fallback to default scheme
I0326 19:16:59.452403  340113 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
I0326 19:16:59.452408  340113 clientconn.go:933] ClientConn switching balancer to "pick_first"
I0326 19:16:59.452509  340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {CONNECTING <nil>}
I0326 19:16:59.452644  340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {READY <nil>}
I0326 19:16:59.453288  340113 factory.go:137] Registering containerd factory
I0326 19:16:59.455480  340113 kubelet_network_linux.go:150] Not using `--random-fully` in the MASQUERADE rule for iptables because the local version of iptables does not support it
I0326 19:16:59.457684  340113 status_manager.go:158] Starting to sync pod status with apiserver
I0326 19:16:59.457711  340113 kubelet.go:1822] Starting kubelet main sync loop.
E0326 19:16:59.457760  340113 kubelet.go:1846] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]

I0326 19:16:59.457873  340113 reflector.go:175] Starting reflector *v1beta1.RuntimeClass (0s) from k8s.io/client-go/informers/factory.go:135
I0326 19:16:59.473898  340113 factory.go:356] Registering Docker factory
I0326 19:16:59.473909  340113 factory.go:54] Registering systemd factory
I0326 19:16:59.474102  340113 factory.go:101] Registering Raw factory
I0326 19:16:59.474274  340113 manager.go:1158] Started watching for new ooms in manager
I0326 19:16:59.475292  340113 manager.go:272] Starting recovery of all containers
I0326 19:16:59.486891  340113 manager.go:277] Recovery completed
E0326 19:16:59.533375  340113 remote_runtime.go:128] StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393  340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods

查找所有未被teardown的容器,验证是否在已分配的ip文件中

sh

# grep "Failed to stop sandbox" /data/log/kubernetes/kubelet.INFO
E0326 19:16:59.527369  340113 kuberuntime_gc.go:170] Failed to stop sandbox "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "debug-agent-2r94g_debug" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528089  340113 kuberuntime_gc.go:170] Failed to stop sandbox "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-mcloud-companyserver-ops-python-dev-555d77d9c7-kd5q5_saas-mcloud-python-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528802  340113 kuberuntime_gc.go:170] Failed to stop sandbox "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-jcpt-open-message-producer-service-tomcat-dev-6cc7fd6n8knt_saas-jcpt-tomcat-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.530857  340113 kuberuntime_gc.go:170] Failed to stop sandbox "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "fluentd-cz8qq_kube-system" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.532710  340113 kuberuntime_gc.go:170] Failed to stop sandbox "9b5d8a1367e4280237bf9f56d0648e0c279b161c19a72b290c3dcaf21a0fcad1" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-xiaoke-libreoffice-service-other-dev-6bcc5d8879-zvvgg_saas-xiaoke-other-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393  340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.536454  340113 kuberuntime_gc.go:170] Failed to stop sandbox "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "push-netstat-htjvj_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537102  340113 kuberuntime_gc.go:170] Failed to stop sandbox "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "k8s-sla-wqm6b_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537724  340113 kuberuntime_gc.go:170] Failed to stop sandbox "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "kibana-k8s-844d67476b-djqbf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.538371  340113 kuberuntime_gc.go:170] Failed to stop sandbox "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-applog-collection-rndhr_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.539003  340113 kuberuntime_gc.go:170] Failed to stop sandbox "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-opslog-collection-c59hf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods



# grep "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.135:decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2
[root@sh-saas-k8s1-node-dev-11 ~]# grep "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.217:a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e
[root@sh-saas-k8s1-node-dev-11 ~]# grep "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.130:950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
[root@sh-saas-k8s1-node-dev-11 ~]# grep "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.134:8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5
[root@sh-saas-k8s1-node-dev-11 ~]# grep "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.131:7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71
[root@sh-saas-k8s1-node-dev-11 ~]# grep "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.132:4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97
[root@sh-saas-k8s1-node-dev-11 ~]# grep "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.235:0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233

发现的确有一些ip中报错的容器id,出现在报错的日志中。

是不是这个错误导致的ip泄漏呢?

通过kuberuntime_gc.go:170,可以知道这个是gc相关代码报的错。

kubelet默认每隔一分钟会进行镜像和容器的gc。其中容器gc任务是根据 /var/lib/dockershim/sandbox/下的文件列表–sanbox容器id列表, 删除一些已经退出的容器和没有关联任何pod的容器或者pod是正在被删除的,还包括删除不存在的sandbox。

为什么日志中的sandbox容器会不存在?

由于旧版本的docker,containerd运行再dockerd下面。新版本containerd是独立运行的,且docker开启了live-restore。所以执行升级必须停掉在运行的容器,否则升级之后这些容器成为孤儿容器–无法进行管理。这里执行停掉和删除所有的容器,如果只是停止容器,升级之后由于containerd的工作目录换了,docker ps也查不到升级前停止的容器。

这些sandbox都是手动执行docker命名停掉并删除的,所以这些sandbox容器都不存在。

kubelet回收不存在的sandbox代码:

pkg\kubelet\kuberuntime\kuberuntime_gc.go

go

// removeSandbox removes the sandbox by sandboxID.
func (cgc *containerGC) removeSandbox(sandboxID string) {
	klog.V(4).Infof("Removing sandbox %q", sandboxID)
	// In normal cases, kubelet should've already called StopPodSandbox before
	// GC kicks in. To guard against the rare cases where this is not true, try
	// stopping the sandbox before removing it.
	if err := cgc.client.StopPodSandbox(sandboxID); err != nil {
		klog.Errorf("Failed to stop sandbox %q before removing: %v", sandboxID, err)
		return
	}
	if err := cgc.client.RemovePodSandbox(sandboxID); err != nil {
		klog.Errorf("Failed to remove sandbox %q: %v", sandboxID, err)
	}
}

而 cgc.client.StopPodSandbox会调用cri api的StopPodSandbox,而我们用的cri是dockershim, dockershim停止sandbox还会调用cni进行ip回收。

pkg\kubelet\dockershim\docker_sandbox.go

go

// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
// TODO: This function blocks sandbox teardown on networking teardown. Is it
// better to cut our losses assuming an out of band GC routine will cleanup
// after us?
func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {
	.......
	// WARNING: The following operations made the following assumption:
	// 1. kubelet will retry on any error returned by StopPodSandbox.
	// 2. tearing down network and stopping sandbox container can succeed in any sequence.
	// This depends on the implementation detail of network plugin and proper error handling.
	// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet
	// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox
	// since it is stopped. With empty network namespace, CNI bridge plugin will conduct best
	// effort clean up and will not return error.
	errList := []error{}
	ready, ok := ds.getNetworkReady(podSandboxID)
	if !hostNetwork && (ready || !ok) {
		// Only tear down the pod network if we haven't done so already
		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
        //这里调用cni回收ip
		err := ds.network.TearDownPod(namespace, name, cID)
		if err == nil {
			ds.setNetworkReady(podSandboxID, false)
		} else {
			errList = append(errList, err)
		}
	}
	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
		// Do not return error if the container does not exist
		if !libdocker.IsContainerNotFoundError(err) {
			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
			errList = append(errList, err)
		} else {
			// remove the checkpoint for any sandbox that is not found in the runtime
			ds.checkpointManager.RemoveCheckpoint(podSandboxID)
		}
	}

而kubenet进行teardown之前会检查netConfig是否初始化–dockershim是否调用UpdateRuntimeConfig

pkg\kubelet\dockershim\network\kubenet\kubenet_linux.go

go

func (plugin *kubenetNetworkPlugin) TearDownPod(namespace string, name string, id kubecontainer.ContainerID) error {
	start := time.Now()
	defer func() {
		klog.V(4).Infof("TearDownPod took %v for %s/%s", time.Since(start), namespace, name)
	}()

	if plugin.netConfig == nil {
		return fmt.Errorf("kubenet needs a PodCIDR to tear down pods")
	}

	if err := plugin.teardown(namespace, name, id); err != nil {
		return err
	}

	// Need to SNAT outbound traffic from cluster
	if err := plugin.ensureMasqRule(); err != nil {
		klog.Errorf("Failed to ensure MASQ rule: %v", err)
	}
	return nil
}

gc是循环执行的,像这样kubenet未初始化,导致未回收ip,等kubenet初始化之后,ip不就回收了。但是发现过了很久,ip文件一直存在。

为什么ip一直不回收?

再来看dockershim的StopPodSandbox代码,当调用ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod),docker会返回container not found错误,然后 执行ds.checkpointManager.RemoveCheckpoint(podSandboxID)–删除/var/lib/dockershim/sandbox/{容器id}文件。即无论网络插件是否调用成功都会执行删除/var/lib/dockershim/sandbox/{容器id}文件–执行ds.checkpointManager.RemoveCheckpoint(podSandboxID)。导致后续gc,不会再对这个sandbox进行回收–ip不会被回收。

go

	ready, ok := ds.getNetworkReady(podSandboxID)
	if !hostNetwork && (ready || !ok) {
		// Only tear down the pod network if we haven't done so already
		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
        //这里调用cni回收ip
		err := ds.network.TearDownPod(namespace, name, cID)
		if err == nil {
			ds.setNetworkReady(podSandboxID, false)
		} else {
			errList = append(errList, err)
		}
	}
	// 由于容器已经不存在了,这里会返回ContainerNotFoundError
	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
		// Do not return error if the container does not exist
		if !libdocker.IsContainerNotFoundError(err) {
			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
			errList = append(errList, err)
		} else {
			// remove the checkpoint for any sandbox that is not found in the runtime
			ds.checkpointManager.RemoveCheckpoint(podSandboxID)
		}
	}

日志中有的dockers id找不到ip,而这些pod的网络模式不是host,为什么呢?可能是后面的gc回收成功了,目前还未找到原因。

手动进行回收ip

shell

#要求本地有kubectl
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers  --field-selector=spec.nodeName=10.11.96.29 |grep -v 10.11.96.29|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
    if ! echo $ips | grep "$ip" &>/dev/null;then \
    echo $ip; \
    docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip |sed 's/\r//); \
    if [ -n "$docker_id" ];then \
        rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo; \
    fi; \
    rm -f /var/lib/cni/networks/kubenet/$ip; \
    fi; \
done

#推荐
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//');  if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip; if [ -n "${docker_id}" ];then rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo;fi; rm -f /var/lib/cni/networks/kubenet/$ip; fi;done

kubelet角度

很明显最佳的解决方案是kubelet等待cni的podcidr设置成功后再进行停止容器操作。

或者dockershim在停止容器出错时候就返回,不会删除checkpoint文件,这样后续gc就能回收ip了。

规避角度

手动停止并删除容器后,再删除/var/lib/cni/cache/results和/var/lib/cni/networks/kubenet目录,这样强制回收所有ip。

不用担心这样操作会有问题,cni插件初始化时候还会自动生成这些文件夹。

辅助手段

写个controller,部署成daemonset,对每个节点分配的ip和使用的ip进行比对,发现有泄漏,手动回收。但这样会与kubelet的调用cni,存在竞争情况,边界情况得考虑进去。

kubenet网络模式ip泄漏,是由于kubelet启动,执行kubenet的初始化操作与执行gc同时进行,如果gc在kubenet插件初始化之前就执行,导致kubenet不回收ip。且由于dockershim执行停止sanbox容器,无论是否成功都会删除checkpoint文件,导致后续gc不再进行这个sandbox的回收,导致ip文件一直存在。

触发场景

升级docker步骤:先驱逐node上的pod,然后停止kubelet,接着停止所有docker容器并删除,升级docker版本。

由于驱逐node上的pod,但是daemonset的pod没有驱逐,所以要手动停止node上的所有容器。

手动停止docker,并不会调用cni插件来回收ip,也不会删除dockershim checkpoint文件。

可以联想到,如果node down掉,也可能触发类似的情况–并未测试。

有关issue

搜到了部分关于network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR的讨论 https://github.com/kubernetes/kubernetes/issues/32900#issuecomment-249086286,大概官方的意思是没有必要问了kubenet而改变顺序,因为其他cni插件不依赖kubelet设置podcidr。

pod创建后又立马删除会触发ip泄漏,但是并没有任何修复

https://github.com/kubernetes/kubernetes/issues/86944

docker重启的时候kubenet出现的ip泄漏问题

https://github.com/kubernetes/kubernetes/issues/34278

调度器是否要对当前node可分配ip进行筛选

https://github.com/kubernetes/kubernetes/issues/21656

扩展

由于触发条件,是使用dockershim且使用kubenet模式。

dockershim要被废弃掉,想要官方修复这种问题,基本没有可能。

长远考虑还是使用其它container runtime,比如containerd。但是换container runtime,必须换cni,kubenet只支持docker。

相关内容