kubenet IP Leak

xiaoqing included in category Kubernetes

2021-04-04 2021-04-04 3398 words 15 minutes

Contents

Recently, after upgrading the Docker version, some pods remained in the ‘pending’ state, and it was found that the reason for the pods’ pending status was the inability to obtain an IP address. After investigation, it was discovered that the Docker version upgrade was performed incorrectly, leading to an IP leak in kubenet, which resulted in no available IPs for allocation.

The Kubernetes version used was 1.18.8, with the network mode set to kubenet, a maximum of 125 pods per node, and a pod CIDR of 25.

1 Symptoms

The events for the affected pods showed that the pod failed to create a sandbox, and the sandbox was continuously created and destroyed. Describing the pod indicated that it was in the ‘pending’ state:

shell

Warning  FailedCreatePodSandBox  3m20s (x30245 over 9h)  kubelet, 10.12.97.31  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "db90a3a26c158a70e4d251336fa62f9f32f7b0643a6ad23d52cdfea5e96c3412" network for pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l": networkPlugin kubenet failed to set up pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l_saas-o2o-public-tomcat-dev" network: error adding container to network: failed to allocate for range 0: no IP addresses available in range set: 10.253.6.129-10.253.6.254

Based on the above information, it appeared that the IP addresses on the node had been exhausted.

To investigate further, the number of allocated IPs on the node was checked:

shell

# kubectl get pod -A -o custom-columns=:status.podIP --no-headers  --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>" |sort |wc -l
118

It was found that the IP addresses had not been fully used, as the pod CIDR was 25, which meant that there were 125 available addresses. However, three addresses could not be used, including the network segment IP, gateway IP, and broadcast IP.

To troubleshoot this issue, it was necessary to understand how kubenet allocated IP addresses.

2 Kubenet Principle

Kubenet is a special CNI (Container Network Interface) plugin integrated into the kubelet. It supports only the Docker runtime. Kubenet encapsulates the bridge, loopback, host-local, and portmap network plugins into a new CNI plugin.

Kubenet uses the host-local plugin for IP allocation. It creates files in /var/lib/cni/networks/kubenet with filenames as IP addresses. These files include last_reserved_ip.0 and lock.

The IP file stores the already allocated IPs, with the container ID and network interface name.
last_reserved_ip.0 stores the last allocated IP, and the next IP allocation starts from this IP.
The lock file is used to avoid conflicts when multiple containers are allocated IPs simultaneously.

The CNI plugin also creates two files in /var/lib/cni/cache/results/ for each container with filenames like kubenet-{container-id}-eth0 and kubenet-loopback-{container-id}-lo. These files contain information about the IP addresses of the container’s network interfaces.

For example, if IP address 10.253.9.21 was allocated:

shell

# cat /var/lib/cni/networks/kubenet/10.253.9.21
1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e
eth0

# cat /var/lib/cni/cache/results/kubenet-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-eth0
{"cniVersion":"0.2.0","ip4":{"ip":"10.253.9.21/25","gateway":"10.253.9.1","routes":[{"dst":"0.0.0.0/0"}]},"dns":{}}

# cat /var/lib/cni/cache/results/kubenet-loopback-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-lo
{"cniVersion":"0.2.0","ip4":{"ip":"127.0.0.1/8"},"dns":{}}

After understanding the Kubenet principle, the next step was to check the IP addresses that had been allocated but not used:

shell

# ls /var/lib/cni/networks/kubenet/
10.253.6.130  10.253.6.141  10.253.6.152  10.253.6.163  10.253.6.174  10.253.6.185  10.253.6.196  10.253.6.207  10.253.6.218  10.253.6.229  10.253.6.240  10.253.6.251
10.253.6.131  10.253.6.142  10.253.6.153  10.253.6.164  10.253.6.175  10.253.6.186  10.253.6.197  10.253.6.208  10.253.6.219  10.253.6.230  10.253.6.241  10.253.6.252
10.253.6.132  10.253.6.143  10.253.6.154  10.253.6.165  10.253.6.176  10.253.6.187  10.253.6.198  10.253.6.209  10.253.6.220  10.253.6.231  10.253.6.242  10.253.6.253
10.253.6.133  10.253.6.144  10.253.6.155  10.253.6.166  10.253.6.177  10.253.6.188  10.253.6.199  10.253.6.210  10.253.6.221  10.253.6.232  10.253.6.243  10.253.6.254
10.253.6.134  10.253.6.145  10.253.6.156  10.253.6.167  10.253.6.178  10.253.6.189  10.253.6.200  10.253.6.211  10.253.6.222  10.253.6.233  10.253.6.244  last_reserved_ip.0
10.253.6.135  10.253.6.146  10.253.6.157  10.253.6.168  10.253.6.179  10.253.6.190  10.253.6.201  10.253.6.212  10.253.6.223  10.253.6.234  10.253.6.245  lock
10.253.6.136  10.253.6.147  10.253.6.158  10.253.6.169  10.253.6.180  10.253.6.191  10.253.6.202  10.253.6.213  10.253.6.224  10.253.6.235  10.253.6.246
10.253.6.137  10.253.6.148  10.253.6.159  10.253.6.170  10.253.6.181  10.253.6.192  10.253.6.203  10.253.6.214  10.253.6.225  10.253.6.236  10.253.6.247
10.253.6.138  10.253.6.149  10.253.6.160  10.253.6.171  10.253.6.182  10.253.6.193  10.253.6.204  10.253.6.215  10.253.6.226  10.253.6.237  10.253.6.248
10.253.6.139  10.253.6.150  10.253.6.161  10.253.6.172  10.253.6.183  10.253.6.194  10.253.6.205  10.253.6.216  10.253.6.227  10.253.6.238  10.253.6.249
10.253.6.140  10.253.6.151  10.253.6.162  10.253.6.173  10.253.6.184  10.253.6.195  10.253.6.206  10.253.6.217  10.253.6.228  10.253.6.239  10.253.6.250

It was found that all available IP addresses had indeed been allocated, but the actual number of used IP addresses did not match the allocated IP addresses, meaning that these IP addresses had not been reclaimed.

To find an unused IP address that has already been allocated:

shell

#require kubectl
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers  --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
    if ! echo $ips | grep "$ip" &>/dev/null;then \
    echo $ip; \
    fi; \
done

#recommend method
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//');  if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip;fi;done

#output
10.253.6.130
10.253.6.131
...

Based on the list of IPs found, it was confirmed that the containers associated with those IPs did not exist:

cat /var/lib/cni/networks/kubenet/10.253.6.130
950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
eth0

docker ps -a |grep 950b9

3 Why Were the IP Addresses Not Reclaimed?

The process of reclaiming IP addresses occurs in the kubelet when it destroys pods. When a pod is destroyed, the kubelet invokes the CNI plugin to release the IP address. In the case of an IP leak, it must have failed at this stage.

Kubelet triggers pod destruction during garbage collection (GC), eviction, and pod deletion.

Checking the kubelet logs revealed some error messages at the very start:

Two key error messages were Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR and StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods.

From these logs, it was clear that when the kubelet was initially starting, the network plugin (kubenet) had not completed initialization because it did not have the PodCIDR yet. As a result, it started to stop the existing pods on the node, and since kubenet requires PodCIDR to tear down pods, it encountered errors.

# These logs were printed when kubelet had just started
I0326 19:16:59.441752  340113 server.go:393] Adding debug handlers to kubelet server.
E0326 19:16:59.452246  340113 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR #kubelet detected that the kubenet network plugin had not completed initialization
I0326 19:16:59.452348  340113 clientconn.go:106] parsed scheme: "unix"
I0326 19:16:59.452354  340113 clientconn.go:106] scheme "unix" not registered, fallback to default scheme
I0326 19:16:59.452403  340113 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
I0326 19:16:59.452408  340113 clientconn.go:933] ClientConn switching balancer to "pick_first"
I0326 19:16:59.452509  340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {CONNECTING <nil>}
I0326 19:16:59.452644  340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {READY <nil>}
I0326 19:16:59.453288  340113 factory.go:137] Registering containerd factory
I0326 19:16:59.455480  340113 kubelet_network_linux.go:150] Not using `--random-fully` in the MASQUERADE rule for iptables because the local version of iptables does not support it
I0326 19:16:59.457684  340113 status_manager.go:158] Starting to sync pod status with apiserver
I0326 19:16:59.457711  340113 kubelet.go:1822] Starting kubelet main sync loop.
E0326 19:16:59.457760  340113 kubelet.go:1846] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]

I0326 19:16:59.457873  340113 reflector.go:175] Starting reflector *v1beta1.RuntimeClass (0s) from k8s.io/client-go/informers/factory.go:135
I0326 19:16:59.473898  340113 factory.go:356] Registering Docker factory
I0326 19:16:59.473909  340113 factory.go:54] Registering systemd factory
I0326 19:16:59.474102  340113 factory.go:101] Registering Raw factory
I0326 19:16:59.474274  340113 manager.go:1158] Started watching for new ooms in manager
I0326 19:16:59.475292  340113 manager.go:272] Starting recovery of all containers
I0326 19:16:59.486891  340113 manager.go:277] Recovery completed
E0326 19:16:59.533375  340113 remote_runtime.go:128] StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393  340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods

Find all containers that have not been torn down and verify whether they are in the allocated ip file

# grep "Failed to stop sandbox" /data/log/kubernetes/kubelet.INFO
E0326 19:16:59.527369  340113 kuberuntime_gc.go:170] Failed to stop sandbox "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "debug-agent-2r94g_debug" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528089  340113 kuberuntime_gc.go:170] Failed to stop sandbox "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-mcloud-companyserver-ops-python-dev-555d77d9c7-kd5q5_saas-mcloud-python-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528802  340113 kuberuntime_gc.go:170] Failed to stop sandbox "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-jcpt-open-message-producer-service-tomcat-dev-6cc7fd6n8knt_saas-jcpt-tomcat-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.530857  340113 kuberuntime_gc.go:170] Failed to stop sandbox "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "fluentd-cz8qq_kube-system" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.532710  340113 kuberuntime_gc.go:170] Failed to stop sandbox "9b5d8a1367e4280237bf9f56d0648e0c279b161c19a72b290c3dcaf21a0fcad1" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-xiaoke-libreoffice-service-other-dev-6bcc5d8879-zvvgg_saas-xiaoke-other-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393  340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.536454  340113 kuberuntime_gc.go:170] Failed to stop sandbox "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "push-netstat-htjvj_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537102  340113 kuberuntime_gc.go:170] Failed to stop sandbox "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "k8s-sla-wqm6b_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537724  340113 kuberuntime_gc.go:170] Failed to stop sandbox "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "kibana-k8s-844d67476b-djqbf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.538371  340113 kuberuntime_gc.go:170] Failed to stop sandbox "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-applog-collection-rndhr_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.539003  340113 kuberuntime_gc.go:170] Failed to stop sandbox "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-opslog-collection-c59hf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods

# grep "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.135:decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2
[root@sh-saas-k8s1-node-dev-11 ~]# grep "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.217:a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e
[root@sh-saas-k8s1-node-dev-11 ~]# grep "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.130:950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
[root@sh-saas-k8s1-node-dev-11 ~]# grep "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.134:8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5
[root@sh-saas-k8s1-node-dev-11 ~]# grep "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.131:7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71
[root@sh-saas-k8s1-node-dev-11 ~]# grep "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.132:4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97
[root@sh-saas-k8s1-node-dev-11 ~]# grep "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.235:0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233

We have indeed discovered some container IDs with errors in the reported IPs, which appear in the error logs.

Is this error causing IP leaks?

By examining kuberuntime_gc.go:170, we can determine that this error is related to the garbage collection (GC) code.

Kubelet, by default, performs image and container garbage collection every minute. Container GC tasks involve deleting containers that have already exited, containers that are not associated with any pods or pods that are being deleted, and also includes deleting non-existent sandboxes, which are listed under /var/lib/dockershim/sandbox/.

Why do the sandbox containers in the logs not exist?

This issue arises due to older versions of Docker, where Containerd used to run under Dockerd. In newer versions, Containerd runs independently, and Docker has live restore enabled. Therefore, during an upgrade, it is necessary to stop all running containers; otherwise, these containers become orphaned and cannot be managed after the upgrade. To address this, all containers are stopped and removed here. If only the containers are stopped, after the upgrade, Docker would not be able to locate the stopped containers since the working directory of Containerd changes, and docker ps wouldn’t show the containers stopped before the upgrade.

All these sandboxes are manually stopped and removed using Docker, which is why these sandbox containers do not exist.

The kubelet recovers non-existent sandboxes using this code:

pkg\kubelet\kuberuntime\kuberuntime_gc.go

// removeSandbox removes the sandbox by sandboxID.
func (cgc *containerGC) removeSandbox(sandboxID string) {
	klog.V(4).Infof("Removing sandbox %q", sandboxID)
	// In normal cases, kubelet should've already called StopPodSandbox before
	// GC kicks in. To guard against the rare cases where this is not true, try
	// stopping the sandbox before removing it.
	if err := cgc.client.StopPodSandbox(sandboxID); err != nil {
		klog.Errorf("Failed to stop sandbox %q before removing: %v", sandboxID, err)
		return
	}
	if err := cgc.client.RemovePodSandbox(sandboxID); err != nil {
		klog.Errorf("Failed to remove sandbox %q: %v", sandboxID, err)
	}
}

And cgc.client.StopPodSandbox calls the CRI API’s StopPodSandbox. In our case, we use Cri-DockerShim, and stopping the sandbox in DockerShim also triggers CNI for IP reclamation.

pkg\kubelet\dockershim\docker_sandbox.go

// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
// TODO: This function blocks sandbox teardown on networking teardown. Is it
// better to cut our losses assuming an out of band GC routine will cleanup
// after us?
func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {
	.......
	// WARNING: The following operations made the following assumption:
	// 1. kubelet will retry on any error returned by StopPodSandbox.
	// 2. tearing down network and stopping sandbox container can succeed in any sequence.
	// This depends on the implementation detail of network plugin and proper error handling.
	// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet
	// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox
	// since it is stopped. With empty network namespace, CNI bridge plugin will conduct best
	// effort clean up and will not return error.
	errList := []error{}
	ready, ok := ds.getNetworkReady(podSandboxID)
	if !hostNetwork && (ready || !ok) {
		// Only tear down the pod network if we haven't done so already
		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
        //这里调用cni回收ip
		err := ds.network.TearDownPod(namespace, name, cID)
		if err == nil {
			ds.setNetworkReady(podSandboxID, false)
		} else {
			errList = append(errList, err)
		}
	}
	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
		// Do not return error if the container does not exist
		if !libdocker.IsContainerNotFoundError(err) {
			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
			errList = append(errList, err)
		} else {
			// remove the checkpoint for any sandbox that is not found in the runtime
			ds.checkpointManager.RemoveCheckpoint(podSandboxID)
		}
	}

Before kubenet performs teardown, it checks whether netConfig has been initialized, i.e., whether Dockershim has called UpdateRuntimeConfig.

pkg\kubelet\dockershim\network\kubenet\kubenet_linux.go

func (plugin *kubenetNetworkPlugin) TearDownPod(namespace string, name string, id kubecontainer.ContainerID) error {
	start := time.Now()
	defer func() {
		klog.V(4).Infof("TearDownPod took %v for %s/%s", time.Since(start), namespace, name)
	}()

	if plugin.netConfig == nil {
		return fmt.Errorf("kubenet needs a PodCIDR to tear down pods")
	}

	if err := plugin.teardown(namespace, name, id); err != nil {
		return err
	}

	// Need to SNAT outbound traffic from cluster
	if err := plugin.ensureMasqRule(); err != nil {
		klog.Errorf("Failed to ensure MASQ rule: %v", err)
	}
	return nil
}

The issue occurs because GC runs concurrently with Kubelet initialization, causing kubenet not to reclaim IPs if GC runs before kubenet initialization. Furthermore, the DockerShim deletes checkpoint files even when stopping containers with errors, preventing IP reclamation by subsequent GC.

Why are IPs not being reclaimed?

Let’s take a closer look at the StopPodSandbox code in DockerShim. When ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod) is called, Docker returns a “container not found” error if the container doesn’t exist. Subsequently, regardless of whether the network plugin successfully reclaims the IP in the ds.network.TearDownPod call, the code proceeds to execute ds.checkpointManager.RemoveCheckpoint(podSandboxID). This removal operation deletes the /var/lib/dockershim/sandbox/{containerID} file. Consequently, whether or not the network plugin call was successful, the checkpoint file is always removed. This has the effect that subsequent garbage collection (GC) operations do not attempt to reclaim the IP associated with that sandbox. Essentially, the removal of the checkpoint file serves as a marker indicating that the sandbox is already stopped and should not be processed further.

	ready, ok := ds.getNetworkReady(podSandboxID)
	if !hostNetwork && (ready || !ok) {
		// Only tear down the pod network if we haven't done so already
		cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
        //这里调用cni回收ip
		err := ds.network.TearDownPod(namespace, name, cID)
		if err == nil {
			ds.setNetworkReady(podSandboxID, false)
		} else {
			errList = append(errList, err)
		}
	}
	// 由于容器已经不存在了，这里会返回ContainerNotFoundError
	if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
		// Do not return error if the container does not exist
		if !libdocker.IsContainerNotFoundError(err) {
			klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
			errList = append(errList, err)
		} else {
			// remove the checkpoint for any sandbox that is not found in the runtime
			ds.checkpointManager.RemoveCheckpoint(podSandboxID)
		}
	}

Why is it that some dockers in the log have ip addresses that cannot be found, while the network mode of these pods is not host? It may be because the subsequent GC collection succeeded, but the cause has not yet been found.

4 Solutions

4.1 Temporary Solution

Manually reclaim IP addresses with the following shell commands:

shell

#Requires kubectl to be available locally
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers  --field-selector=spec.nodeName=10.11.96.29 |grep -v 10.11.96.29|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
    if ! echo $ips | grep "$ip" &>/dev/null;then \
    echo $ip; \
    docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip |sed 's/\r//); \
    if [ -n "$docker_id" ];then \
        rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo; \
    fi; \
    rm -f /var/lib/cni/networks/kubenet/$ip; \
    fi; \
done

#Recommended
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//');  if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip; if [ -n "${docker_id}" ];then rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo;fi; rm -f /var/lib/cni/networks/kubenet/$ip; fi;done

4.2 Permanent Solution

From the Perspective of Kubelet:

The optimal solution is for Kubelet to wait until the CNI plugin has successfully set the PodCIDR before proceeding with container stop operations.

Alternatively, DockerShim could be modified to return without deleting checkpoint files when stopping containers encounters errors. This would allow subsequent GC operations to reclaim IPs.

From a Mitigation Perspective:

After manually stopping and deleting containers, you can also delete the /var/lib/cni/cache/results and /var/lib/cni/networks/kubenet directories to forcibly reclaim all IPs. This operation should not cause issues, as the CNI plugin will regenerate these directories during initialization.

Additional Measures:

Consider writing a controller deployed as a DaemonSet. This controller can compare the allocated IPs on each node with the IPs in use. If any leaks are detected, it can manually reclaim them. However, be aware that this approach may introduce race conditions when competing with Kubelet for IP management, so it requires careful consideration of edge cases.

5 Summary

The issue of IP leaks in Kubenet network mode occurs due to a race condition between Kubelet’s startup, the initialization of the Kubenet plugin, and the execution of garbage collection (GC). If GC runs before the Kubenet plugin initializes, IPs may not be reclaimed. Furthermore, DockerShim’s process of stopping sandbox containers, regardless of success, results in the removal of checkpoint files. This prevents subsequent GC from attempting to reclaim IPs associated with those sandboxes, causing IP files to persist.

Trigger Scenarios

The issue is triggered during Docker upgrade steps, involving node eviction of pods, stopping Kubelet, manually stopping and removing all Docker containers, and upgrading Docker. When pods are evicted from nodes but DaemonSet pods are not, manual intervention to stop containers on nodes is required. Manually stopping Docker containers does not invoke the CNI plugin to reclaim IPs or remove DockerShim checkpoint files.

It’s worth noting that similar issues could occur if a node goes down, although this scenario has not been tested.

Related Issues

Some discussions related to the “network plugin is not ready: kubenet does not have netConfig. This is most likely due to the lack of PodCIDR” issue can be found here: GitHub Issue. In summary, the official stance is that there’s no need to wait for Kubenet and reorder operations because other CNI plugins do not rely on Kubelet setting the PodCIDR.

There have been reports of IP leaks when pods are created and deleted rapidly, but no fixes have been provided: GitHub Issue.

Another issue related to IP leaks in Kubenet occurs during Docker restarts: GitHub Issue.

There’s also a discussion about whether the scheduler should filter nodes based on the availability of assignable IPs: GitHub Issue.

Extensions

The triggering condition involves the use of DockerShim and the Kubenet network mode. Since DockerShim is being deprecated, it’s unlikely that the official Kubernetes project will fix this issue. In the long term, transitioning to a different container runtime like Containerd would require changing the CNI plugin, as Kubenet is specific to Docker.

Contents

kubenet IP Leak

1 Symptoms

2 Kubenet Principle

3 Why Were the IP Addresses Not Reclaimed?

4 Solutions

4.1 Temporary Solution

4.2 Permanent Solution

5 Summary

Related Content

Compute Freedom: Scale Your K8s GPU Cluster to 'Infinity' with Tailscale

Subpath Mounted File is Empty in Container, Bug?

MigrationController Plugin of Koordinator Descheduler: Ensuring Successful Scheduling of Pods After Eviction

Koordinator Descheduler: LowNodeLoad Plugin Enhancing Node Balance and Application Stability