Recently, after upgrading the Docker version, some pods remained in the ‘pending’ state, and it was found that the reason for the pods’ pending status was the inability to obtain an IP address. After investigation, it was discovered that the Docker version upgrade was performed incorrectly, leading to an IP leak in kubenet, which resulted in no available IPs for allocation.
The Kubernetes version used was 1.18.8, with the network mode set to kubenet, a maximum of 125 pods per node, and a pod CIDR of 25.
The events for the affected pods showed that the pod failed to create a sandbox, and the sandbox was continuously created and destroyed. Describing the pod indicated that it was in the ‘pending’ state:
1
| Warning FailedCreatePodSandBox 3m20s (x30245 over 9h) kubelet, 10.12.97.31 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "db90a3a26c158a70e4d251336fa62f9f32f7b0643a6ad23d52cdfea5e96c3412" network for pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l": networkPlugin kubenet failed to set up pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l_saas-o2o-public-tomcat-dev" network: error adding container to network: failed to allocate for range 0: no IP addresses available in range set: 10.253.6.129-10.253.6.254
|
Based on the above information, it appeared that the IP addresses on the node had been exhausted.
To investigate further, the number of allocated IPs on the node was checked:
1
2
| # kubectl get pod -A -o custom-columns=:status.podIP --no-headers --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>" |sort |wc -l
118
|
It was found that the IP addresses had not been fully used, as the pod CIDR was 25, which meant that there were 125 available addresses. However, three addresses could not be used, including the network segment IP, gateway IP, and broadcast IP.
To troubleshoot this issue, it was necessary to understand how kubenet allocated IP addresses.
Kubenet is a special CNI (Container Network Interface) plugin integrated into the kubelet. It supports only the Docker runtime. Kubenet encapsulates the bridge, loopback, host-local, and portmap network plugins into a new CNI plugin.
Kubenet uses the host-local plugin for IP allocation. It creates files in /var/lib/cni/networks/kubenet
with filenames as IP addresses. These files include last_reserved_ip.0
and lock
.
- The IP file stores the already allocated IPs, with the container ID and network interface name.
last_reserved_ip.0
stores the last allocated IP, and the next IP allocation starts from this IP.- The lock file is used to avoid conflicts when multiple containers are allocated IPs simultaneously.
The CNI plugin also creates two files in /var/lib/cni/cache/results/
for each container with filenames like kubenet-{container-id}-eth0
and kubenet-loopback-{container-id}-lo
. These files contain information about the IP addresses of the container’s network interfaces.
For example, if IP address 10.253.9.21 was allocated:
1
2
3
4
5
6
7
8
9
| # cat /var/lib/cni/networks/kubenet/10.253.9.21
1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e
eth0
# cat /var/lib/cni/cache/results/kubenet-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-eth0
{"cniVersion":"0.2.0","ip4":{"ip":"10.253.9.21/25","gateway":"10.253.9.1","routes":[{"dst":"0.0.0.0/0"}]},"dns":{}}
# cat /var/lib/cni/cache/results/kubenet-loopback-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-lo
{"cniVersion":"0.2.0","ip4":{"ip":"127.0.0.1/8"},"dns":{}}
|
After understanding the Kubenet principle, the next step was to check the IP addresses that had been allocated but not used:
1
2
3
4
5
6
7
8
9
10
11
12
| # ls /var/lib/cni/networks/kubenet/
10.253.6.130 10.253.6.141 10.253.6.152 10.253.6.163 10.253.6.174 10.253.6.185 10.253.6.196 10.253.6.207 10.253.6.218 10.253.6.229 10.253.6.240 10.253.6.251
10.253.6.131 10.253.6.142 10.253.6.153 10.253.6.164 10.253.6.175 10.253.6.186 10.253.6.197 10.253.6.208 10.253.6.219 10.253.6.230 10.253.6.241 10.253.6.252
10.253.6.132 10.253.6.143 10.253.6.154 10.253.6.165 10.253.6.176 10.253.6.187 10.253.6.198 10.253.6.209 10.253.6.220 10.253.6.231 10.253.6.242 10.253.6.253
10.253.6.133 10.253.6.144 10.253.6.155 10.253.6.166 10.253.6.177 10.253.6.188 10.253.6.199 10.253.6.210 10.253.6.221 10.253.6.232 10.253.6.243 10.253.6.254
10.253.6.134 10.253.6.145 10.253.6.156 10.253.6.167 10.253.6.178 10.253.6.189 10.253.6.200 10.253.6.211 10.253.6.222 10.253.6.233 10.253.6.244 last_reserved_ip.0
10.253.6.135 10.253.6.146 10.253.6.157 10.253.6.168 10.253.6.179 10.253.6.190 10.253.6.201 10.253.6.212 10.253.6.223 10.253.6.234 10.253.6.245 lock
10.253.6.136 10.253.6.147 10.253.6.158 10.253.6.169 10.253.6.180 10.253.6.191 10.253.6.202 10.253.6.213 10.253.6.224 10.253.6.235 10.253.6.246
10.253.6.137 10.253.6.148 10.253.6.159 10.253.6.170 10.253.6.181 10.253.6.192 10.253.6.203 10.253.6.214 10.253.6.225 10.253.6.236 10.253.6.247
10.253.6.138 10.253.6.149 10.253.6.160 10.253.6.171 10.253.6.182 10.253.6.193 10.253.6.204 10.253.6.215 10.253.6.226 10.253.6.237 10.253.6.248
10.253.6.139 10.253.6.150 10.253.6.161 10.253.6.172 10.253.6.183 10.253.6.194 10.253.6.205 10.253.6.216 10.253.6.227 10.253.6.238 10.253.6.249
10.253.6.140 10.253.6.151 10.253.6.162 10.253.6.173 10.253.6.184 10.253.6.195 10.253.6.206 10.253.6.217 10.253.6.228 10.253.6.239 10.253.6.250
|
It was found that all available IP addresses had indeed been allocated, but the actual number of used IP addresses did not match the allocated IP addresses, meaning that these IP addresses had not been reclaimed.
To find an unused IP address that has already been allocated:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| #require kubectl
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
if ! echo $ips | grep "$ip" &>/dev/null;then \
echo $ip; \
fi; \
done
#recommend method
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//'); if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip;fi;done
#output
10.253.6.130
10.253.6.131
...
|
Based on the list of IPs found, it was confirmed that the containers associated with those IPs did not exist:
1
2
3
4
5
| cat /var/lib/cni/networks/kubenet/10.253.6.130
950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
eth0
docker ps -a |grep 950b9
|
The process of reclaiming IP addresses occurs in the kubelet when it destroys pods. When a pod is destroyed, the kubelet invokes the CNI plugin to release the IP address. In the case of an IP leak, it must have failed at this stage.
Kubelet triggers pod destruction during garbage collection (GC), eviction, and pod deletion.
Checking the kubelet logs revealed some error messages at the very start:
Two key error messages were Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR
and StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
.
From these logs, it was clear that when the kubelet was initially starting, the network plugin (kubenet) had not completed initialization because it did not have the PodCIDR yet. As a result, it started to stop the existing pods on the node, and since kubenet requires PodCIDR to tear down pods, it encountered errors.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| # These logs were printed when kubelet had just started
I0326 19:16:59.441752 340113 server.go:393] Adding debug handlers to kubelet server.
E0326 19:16:59.452246 340113 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR #kubelet detected that the kubenet network plugin had not completed initialization
I0326 19:16:59.452348 340113 clientconn.go:106] parsed scheme: "unix"
I0326 19:16:59.452354 340113 clientconn.go:106] scheme "unix" not registered, fallback to default scheme
I0326 19:16:59.452403 340113 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}
I0326 19:16:59.452408 340113 clientconn.go:933] ClientConn switching balancer to "pick_first"
I0326 19:16:59.452509 340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {CONNECTING <nil>}
I0326 19:16:59.452644 340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {READY <nil>}
I0326 19:16:59.453288 340113 factory.go:137] Registering containerd factory
I0326 19:16:59.455480 340113 kubelet_network_linux.go:150] Not using `--random-fully` in the MASQUERADE rule for iptables because the local version of iptables does not support it
I0326 19:16:59.457684 340113 status_manager.go:158] Starting to sync pod status with apiserver
I0326 19:16:59.457711 340113 kubelet.go:1822] Starting kubelet main sync loop.
E0326 19:16:59.457760 340113 kubelet.go:1846] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
I0326 19:16:59.457873 340113 reflector.go:175] Starting reflector *v1beta1.RuntimeClass (0s) from k8s.io/client-go/informers/factory.go:135
I0326 19:16:59.473898 340113 factory.go:356] Registering Docker factory
I0326 19:16:59.473909 340113 factory.go:54] Registering systemd factory
I0326 19:16:59.474102 340113 factory.go:101] Registering Raw factory
I0326 19:16:59.474274 340113 manager.go:1158] Started watching for new ooms in manager
I0326 19:16:59.475292 340113 manager.go:272] Starting recovery of all containers
I0326 19:16:59.486891 340113 manager.go:277] Recovery completed
E0326 19:16:59.533375 340113 remote_runtime.go:128] StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393 340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
|
Find all containers that have not been torn down and verify whether they are in the allocated ip file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| # grep "Failed to stop sandbox" /data/log/kubernetes/kubelet.INFO
E0326 19:16:59.527369 340113 kuberuntime_gc.go:170] Failed to stop sandbox "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "debug-agent-2r94g_debug" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528089 340113 kuberuntime_gc.go:170] Failed to stop sandbox "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-mcloud-companyserver-ops-python-dev-555d77d9c7-kd5q5_saas-mcloud-python-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528802 340113 kuberuntime_gc.go:170] Failed to stop sandbox "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-jcpt-open-message-producer-service-tomcat-dev-6cc7fd6n8knt_saas-jcpt-tomcat-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.530857 340113 kuberuntime_gc.go:170] Failed to stop sandbox "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "fluentd-cz8qq_kube-system" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.532710 340113 kuberuntime_gc.go:170] Failed to stop sandbox "9b5d8a1367e4280237bf9f56d0648e0c279b161c19a72b290c3dcaf21a0fcad1" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-xiaoke-libreoffice-service-other-dev-6bcc5d8879-zvvgg_saas-xiaoke-other-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393 340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.536454 340113 kuberuntime_gc.go:170] Failed to stop sandbox "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "push-netstat-htjvj_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537102 340113 kuberuntime_gc.go:170] Failed to stop sandbox "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "k8s-sla-wqm6b_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537724 340113 kuberuntime_gc.go:170] Failed to stop sandbox "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "kibana-k8s-844d67476b-djqbf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.538371 340113 kuberuntime_gc.go:170] Failed to stop sandbox "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-applog-collection-rndhr_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.539003 340113 kuberuntime_gc.go:170] Failed to stop sandbox "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-opslog-collection-c59hf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
# grep "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.135:decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2
[root@sh-saas-k8s1-node-dev-11 ~]# grep "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.217:a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e
[root@sh-saas-k8s1-node-dev-11 ~]# grep "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.130:950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
[root@sh-saas-k8s1-node-dev-11 ~]# grep "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.134:8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5
[root@sh-saas-k8s1-node-dev-11 ~]# grep "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.131:7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71
[root@sh-saas-k8s1-node-dev-11 ~]# grep "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.132:4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97
[root@sh-saas-k8s1-node-dev-11 ~]# grep "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.235:0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233
|
We have indeed discovered some container IDs with errors in the reported IPs, which appear in the error logs.
Is this error causing IP leaks?
By examining kuberuntime_gc.go:170
, we can determine that this error is related to the garbage collection (GC) code.
Kubelet, by default, performs image and container garbage collection every minute. Container GC tasks involve deleting containers that have already exited, containers that are not associated with any pods or pods that are being deleted, and also includes deleting non-existent sandboxes, which are listed under /var/lib/dockershim/sandbox/
.
Why do the sandbox containers in the logs not exist?
This issue arises due to older versions of Docker, where Containerd used to run under Dockerd. In newer versions, Containerd runs independently, and Docker has live restore enabled. Therefore, during an upgrade, it is necessary to stop all running containers; otherwise, these containers become orphaned and cannot be managed after the upgrade. To address this, all containers are stopped and removed here. If only the containers are stopped, after the upgrade, Docker would not be able to locate the stopped containers since the working directory of Containerd changes, and docker ps
wouldn’t show the containers stopped before the upgrade.
All these sandboxes are manually stopped and removed using Docker, which is why these sandbox containers do not exist.
The kubelet recovers non-existent sandboxes using this code:
pkg\kubelet\kuberuntime\kuberuntime_gc.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| // removeSandbox removes the sandbox by sandboxID.
func (cgc *containerGC) removeSandbox(sandboxID string) {
klog.V(4).Infof("Removing sandbox %q", sandboxID)
// In normal cases, kubelet should've already called StopPodSandbox before
// GC kicks in. To guard against the rare cases where this is not true, try
// stopping the sandbox before removing it.
if err := cgc.client.StopPodSandbox(sandboxID); err != nil {
klog.Errorf("Failed to stop sandbox %q before removing: %v", sandboxID, err)
return
}
if err := cgc.client.RemovePodSandbox(sandboxID); err != nil {
klog.Errorf("Failed to remove sandbox %q: %v", sandboxID, err)
}
}
|
And cgc.client.StopPodSandbox
calls the CRI API’s StopPodSandbox
. In our case, we use Cri-DockerShim, and stopping the sandbox in DockerShim also triggers CNI for IP reclamation.
pkg\kubelet\dockershim\docker_sandbox.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| // StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
// TODO: This function blocks sandbox teardown on networking teardown. Is it
// better to cut our losses assuming an out of band GC routine will cleanup
// after us?
func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {
.......
// WARNING: The following operations made the following assumption:
// 1. kubelet will retry on any error returned by StopPodSandbox.
// 2. tearing down network and stopping sandbox container can succeed in any sequence.
// This depends on the implementation detail of network plugin and proper error handling.
// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet
// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox
// since it is stopped. With empty network namespace, CNI bridge plugin will conduct best
// effort clean up and will not return error.
errList := []error{}
ready, ok := ds.getNetworkReady(podSandboxID)
if !hostNetwork && (ready || !ok) {
// Only tear down the pod network if we haven't done so already
cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
//这里调用cni回收ip
err := ds.network.TearDownPod(namespace, name, cID)
if err == nil {
ds.setNetworkReady(podSandboxID, false)
} else {
errList = append(errList, err)
}
}
if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
// Do not return error if the container does not exist
if !libdocker.IsContainerNotFoundError(err) {
klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
errList = append(errList, err)
} else {
// remove the checkpoint for any sandbox that is not found in the runtime
ds.checkpointManager.RemoveCheckpoint(podSandboxID)
}
}
|
Before kubenet performs teardown, it checks whether netConfig
has been initialized, i.e., whether Dockershim has called UpdateRuntimeConfig
.
pkg\kubelet\dockershim\network\kubenet\kubenet_linux.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| func (plugin *kubenetNetworkPlugin) TearDownPod(namespace string, name string, id kubecontainer.ContainerID) error {
start := time.Now()
defer func() {
klog.V(4).Infof("TearDownPod took %v for %s/%s", time.Since(start), namespace, name)
}()
if plugin.netConfig == nil {
return fmt.Errorf("kubenet needs a PodCIDR to tear down pods")
}
if err := plugin.teardown(namespace, name, id); err != nil {
return err
}
// Need to SNAT outbound traffic from cluster
if err := plugin.ensureMasqRule(); err != nil {
klog.Errorf("Failed to ensure MASQ rule: %v", err)
}
return nil
}
|
The issue occurs because GC runs concurrently with Kubelet initialization, causing kubenet not to reclaim IPs if GC runs before kubenet initialization. Furthermore, the DockerShim deletes checkpoint files even when stopping containers with errors, preventing IP reclamation by subsequent GC.
Why are IPs not being reclaimed?
Let’s take a closer look at the StopPodSandbox code in DockerShim. When ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod)
is called, Docker returns a “container not found” error if the container doesn’t exist. Subsequently, regardless of whether the network plugin successfully reclaims the IP in the ds.network.TearDownPod
call, the code proceeds to execute ds.checkpointManager.RemoveCheckpoint(podSandboxID)
. This removal operation deletes the /var/lib/dockershim/sandbox/{containerID}
file. Consequently, whether or not the network plugin call was successful, the checkpoint file is always removed. This has the effect that subsequent garbage collection (GC) operations do not attempt to reclaim the IP associated with that sandbox. Essentially, the removal of the checkpoint file serves as a marker indicating that the sandbox is already stopped and should not be processed further.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| ready, ok := ds.getNetworkReady(podSandboxID)
if !hostNetwork && (ready || !ok) {
// Only tear down the pod network if we haven't done so already
cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
//这里调用cni回收ip
err := ds.network.TearDownPod(namespace, name, cID)
if err == nil {
ds.setNetworkReady(podSandboxID, false)
} else {
errList = append(errList, err)
}
}
// 由于容器已经不存在了,这里会返回ContainerNotFoundError
if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
// Do not return error if the container does not exist
if !libdocker.IsContainerNotFoundError(err) {
klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
errList = append(errList, err)
} else {
// remove the checkpoint for any sandbox that is not found in the runtime
ds.checkpointManager.RemoveCheckpoint(podSandboxID)
}
}
|
Why is it that some dockers in the log have ip addresses that cannot be found, while the network mode of these pods is not host? It may be because the subsequent GC collection succeeded, but the cause has not yet been found.
Manually reclaim IP addresses with the following shell commands:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| #Requires kubectl to be available locally
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers --field-selector=spec.nodeName=10.11.96.29 |grep -v 10.11.96.29|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
if ! echo $ips | grep "$ip" &>/dev/null;then \
echo $ip; \
docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip |sed 's/\r//); \
if [ -n "$docker_id" ];then \
rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo; \
fi; \
rm -f /var/lib/cni/networks/kubenet/$ip; \
fi; \
done
#Recommended
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//'); if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip; if [ -n "${docker_id}" ];then rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo;fi; rm -f /var/lib/cni/networks/kubenet/$ip; fi;done
|
From the Perspective of Kubelet:
The optimal solution is for Kubelet to wait until the CNI plugin has successfully set the PodCIDR before proceeding with container stop operations.
Alternatively, DockerShim could be modified to return without deleting checkpoint files when stopping containers encounters errors. This would allow subsequent GC operations to reclaim IPs.
From a Mitigation Perspective:
After manually stopping and deleting containers, you can also delete the /var/lib/cni/cache/results
and /var/lib/cni/networks/kubenet
directories to forcibly reclaim all IPs. This operation should not cause issues, as the CNI plugin will regenerate these directories during initialization.
Additional Measures:
Consider writing a controller deployed as a DaemonSet. This controller can compare the allocated IPs on each node with the IPs in use. If any leaks are detected, it can manually reclaim them. However, be aware that this approach may introduce race conditions when competing with Kubelet for IP management, so it requires careful consideration of edge cases.
The issue of IP leaks in Kubenet network mode occurs due to a race condition between Kubelet’s startup, the initialization of the Kubenet plugin, and the execution of garbage collection (GC). If GC runs before the Kubenet plugin initializes, IPs may not be reclaimed. Furthermore, DockerShim’s process of stopping sandbox containers, regardless of success, results in the removal of checkpoint files. This prevents subsequent GC from attempting to reclaim IPs associated with those sandboxes, causing IP files to persist.
Trigger Scenarios
The issue is triggered during Docker upgrade steps, involving node eviction of pods, stopping Kubelet, manually stopping and removing all Docker containers, and upgrading Docker. When pods are evicted from nodes but DaemonSet pods are not, manual intervention to stop containers on nodes is required. Manually stopping Docker containers does not invoke the CNI plugin to reclaim IPs or remove DockerShim checkpoint files.
It’s worth noting that similar issues could occur if a node goes down, although this scenario has not been tested.
Related Issues
Some discussions related to the “network plugin is not ready: kubenet does not have netConfig. This is most likely due to the lack of PodCIDR” issue can be found here: GitHub Issue. In summary, the official stance is that there’s no need to wait for Kubenet and reorder operations because other CNI plugins do not rely on Kubelet setting the PodCIDR.
There have been reports of IP leaks when pods are created and deleted rapidly, but no fixes have been provided: GitHub Issue.
Another issue related to IP leaks in Kubenet occurs during Docker restarts: GitHub Issue.
There’s also a discussion about whether the scheduler should filter nodes based on the availability of assignable IPs: GitHub Issue.
Extensions
The triggering condition involves the use of DockerShim and the Kubenet network mode. Since DockerShim is being deprecated, it’s unlikely that the official Kubernetes project will fix this issue. In the long term, transitioning to a different container runtime like Containerd would require changing the CNI plugin, as Kubenet is specific to Docker.