kubenet ip泄漏
最近在升级docker版本之后,出现了一些pod一直处于pending状态,查看pod pending的原因是无法获得ip。最后排查发现升级docker版本的操作方式不对, 导致了kubenet ip泄漏,进而导致没有ip可以分配。
kubernetes版本1.18.8,网络模式为kubenet,每个node最大pod数量为125个,podcidr为25。
1 现象
这个pod的event显示pod failed to create sandbox,sandbox一直不断创建和销毁,describe pod处于pending状态:
Warning FailedCreatePodSandBox 3m20s (x30245 over 9h) kubelet, 10.12.97.31 (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "db90a3a26c158a70e4d251336fa62f9f32f7b0643a6ad23d52cdfea5e96c3412" network for pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l": networkPlugin kubenet failed to set up pod "saas-o2o-public-notification-task-tomcat-dev-7c76c6984d-hjn4l_saas-o2o-public-tomcat-dev" network: error adding container to network: failed to allocate for range 0: no IP addresses available in range set: 10.253.6.129-10.253.6.254
根据上面信息,应该是node节点的ip已经分配完了。
查看node节点的分配IP数量:
# kubectl get pod -A -o custom-columns=:status.podIP --no-headers --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>" |sort |wc -l
118
发现IP地址并没有用完,因为podcidr为25,一共有125个地址可以用,其中有三个地址无法使用,包括网络段ip、网关ip和广播ip。
要排查这个问题,需要知道kubenet是如何分配ip的?
2 kubenet原理
kubenet是一种特殊的cni插件,它集成在kubelet中,只支持docker运行时,它将bridge和lo和host-local,portmap网络插件进行封装成新的cni插件。
kubenet使用的host-local插件进行ip分配,它会在/var/lib/cni/networks/kubenet 生成ip为文件名的文件,last_reserved_ip.0
和lock
。
- ip文件–保存已经分配ip,文件里保存容器id和网卡名称
last_reserved_ip.0
保存最后分配的ip,下次分配ip,从这个ip后面开始寻找可用ip。- lock锁文件,避免同时分配ip导致冲突
其中cni插件还会在/var/lib/cni/cache/results/里 ,每个容器生成两个文件名为kubenet-{容器id}-eth0
和kubenet-loopback-{容器id}-lo
,里面保存各个容器的网卡的ip信息。
比如说 分配了10.253.9.21
# cat /var/lib/cni/networks/kubenet/10.253.9.21
1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e
eth0
# cat /var/lib/cni/cache/results/kubenet-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-eth0
{"cniVersion":"0.2.0","ip4":{"ip":"10.253.9.21/25","gateway":"10.253.9.1","routes":[{"dst":"0.0.0.0/0"}]},"dns":{}}
# cat /var/lib/cni/cache/results/kubenet-loopback-1cfead4f783bbd3928acd6b23e283e9d434443df74743fe65206b569d088a48e-lo
{"cniVersion":"0.2.0","ip4":{"ip":"127.0.0.1/8"},"dns":{}}
了解kubenet原理之后,查看node已经分配的ip
# ls /var/lib/cni/networks/kubenet/
10.253.6.130 10.253.6.141 10.253.6.152 10.253.6.163 10.253.6.174 10.253.6.185 10.253.6.196 10.253.6.207 10.253.6.218 10.253.6.229 10.253.6.240 10.253.6.251
10.253.6.131 10.253.6.142 10.253.6.153 10.253.6.164 10.253.6.175 10.253.6.186 10.253.6.197 10.253.6.208 10.253.6.219 10.253.6.230 10.253.6.241 10.253.6.252
10.253.6.132 10.253.6.143 10.253.6.154 10.253.6.165 10.253.6.176 10.253.6.187 10.253.6.198 10.253.6.209 10.253.6.220 10.253.6.231 10.253.6.242 10.253.6.253
10.253.6.133 10.253.6.144 10.253.6.155 10.253.6.166 10.253.6.177 10.253.6.188 10.253.6.199 10.253.6.210 10.253.6.221 10.253.6.232 10.253.6.243 10.253.6.254
10.253.6.134 10.253.6.145 10.253.6.156 10.253.6.167 10.253.6.178 10.253.6.189 10.253.6.200 10.253.6.211 10.253.6.222 10.253.6.233 10.253.6.244 last_reserved_ip.0
10.253.6.135 10.253.6.146 10.253.6.157 10.253.6.168 10.253.6.179 10.253.6.190 10.253.6.201 10.253.6.212 10.253.6.223 10.253.6.234 10.253.6.245 lock
10.253.6.136 10.253.6.147 10.253.6.158 10.253.6.169 10.253.6.180 10.253.6.191 10.253.6.202 10.253.6.213 10.253.6.224 10.253.6.235 10.253.6.246
10.253.6.137 10.253.6.148 10.253.6.159 10.253.6.170 10.253.6.181 10.253.6.192 10.253.6.203 10.253.6.214 10.253.6.225 10.253.6.236 10.253.6.247
10.253.6.138 10.253.6.149 10.253.6.160 10.253.6.171 10.253.6.182 10.253.6.193 10.253.6.204 10.253.6.215 10.253.6.226 10.253.6.237 10.253.6.248
10.253.6.139 10.253.6.150 10.253.6.161 10.253.6.172 10.253.6.183 10.253.6.194 10.253.6.205 10.253.6.216 10.253.6.227 10.253.6.238 10.253.6.249
10.253.6.140 10.253.6.151 10.253.6.162 10.253.6.173 10.253.6.184 10.253.6.195 10.253.6.206 10.253.6.217 10.253.6.228 10.253.6.239 10.253.6.250
发现的确所有可用的ip都已经被分配了,但实际使用ip数量与已分配ip对不上,也就是说 这些ip没有被回收。
查找已经分配但未使用的ip:
#要求本地有kubectl,不推荐
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers --field-selector=spec.nodeName=10.12.97.31 |grep -v 10.12.97.31|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
if ! echo $ips | grep "$ip" &>/dev/null;then \
echo $ip; \
fi; \
done
#推荐
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//'); if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip;fi;done
#输出
10.253.6.130
10.253.6.131
...
根据查找出来ip,确认容器时否存在,发现容器的确不存在
cat /var/lib/cni/networks/kubenet/10.253.6.130
950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
eth0
docker ps -a |grep 950b9
3 为什么ip地址没有被回收呢?
回收ip发生在kubelet对pod进行销毁动作,在pod销毁时候,kubelet会调用cni插件,进行ip回收。而出现ip泄漏,一定是这个阶段执行失败导致的。
kubelet里gc,evicted,pod的delete都会触发销毁pod。
查找kubelet日志发现,刚启动的时候就有一些报错日志:
有两个关键日志Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR
和 StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
从日志读出来,在kubelet重新启动的时候, 网络插件还未初始化–podCIDR未获取到,然后就开始停止node上面的已有的pod,而kubenet插件需要podCIDR来tear down pod,所以报错了。
#这些日志是在kubelet刚启动时候打印的
I0326 19:16:59.441752 340113 server.go:393] Adding debug handlers to kubelet server.
E0326 19:16:59.452246 340113 kubelet.go:2188] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR #kubelet检测到kubenet网络插件还没有完成初始化
I0326 19:16:59.452348 340113 clientconn.go:106] parsed scheme: "unix"
I0326 19:16:59.452354 340113 clientconn.go:106] scheme "unix" not registered, fallback to default scheme
I0326 19:16:59.452403 340113 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock <nil> 0 <nil>}] <nil> <nil>}
I0326 19:16:59.452408 340113 clientconn.go:933] ClientConn switching balancer to "pick_first"
I0326 19:16:59.452509 340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {CONNECTING <nil>}
I0326 19:16:59.452644 340113 balancer_conn_wrappers.go:78] pickfirstBalancer: HandleSubConnStateChange: 0xc000dcfb80, {READY <nil>}
I0326 19:16:59.453288 340113 factory.go:137] Registering containerd factory
I0326 19:16:59.455480 340113 kubelet_network_linux.go:150] Not using `--random-fully` in the MASQUERADE rule for iptables because the local version of iptables does not support it
I0326 19:16:59.457684 340113 status_manager.go:158] Starting to sync pod status with apiserver
I0326 19:16:59.457711 340113 kubelet.go:1822] Starting kubelet main sync loop.
E0326 19:16:59.457760 340113 kubelet.go:1846] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]
I0326 19:16:59.457873 340113 reflector.go:175] Starting reflector *v1beta1.RuntimeClass (0s) from k8s.io/client-go/informers/factory.go:135
I0326 19:16:59.473898 340113 factory.go:356] Registering Docker factory
I0326 19:16:59.473909 340113 factory.go:54] Registering systemd factory
I0326 19:16:59.474102 340113 factory.go:101] Registering Raw factory
I0326 19:16:59.474274 340113 manager.go:1158] Started watching for new ooms in manager
I0326 19:16:59.475292 340113 manager.go:272] Starting recovery of all containers
I0326 19:16:59.486891 340113 manager.go:277] Recovery completed
E0326 19:16:59.533375 340113 remote_runtime.go:128] StopPodSandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" from runtime service failed: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393 340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
查找所有未被teardown的容器,验证是否在已分配的ip文件中
# grep "Failed to stop sandbox" /data/log/kubernetes/kubelet.INFO
E0326 19:16:59.527369 340113 kuberuntime_gc.go:170] Failed to stop sandbox "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "debug-agent-2r94g_debug" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528089 340113 kuberuntime_gc.go:170] Failed to stop sandbox "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-mcloud-companyserver-ops-python-dev-555d77d9c7-kd5q5_saas-mcloud-python-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.528802 340113 kuberuntime_gc.go:170] Failed to stop sandbox "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-jcpt-open-message-producer-service-tomcat-dev-6cc7fd6n8knt_saas-jcpt-tomcat-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.530857 340113 kuberuntime_gc.go:170] Failed to stop sandbox "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "fluentd-cz8qq_kube-system" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.532710 340113 kuberuntime_gc.go:170] Failed to stop sandbox "9b5d8a1367e4280237bf9f56d0648e0c279b161c19a72b290c3dcaf21a0fcad1" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "saas-xiaoke-libreoffice-service-other-dev-6bcc5d8879-zvvgg_saas-xiaoke-other-dev" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.533393 340113 kuberuntime_gc.go:170] Failed to stop sandbox "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "container-exporter-m9d2r_eventwatcher" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.536454 340113 kuberuntime_gc.go:170] Failed to stop sandbox "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "push-netstat-htjvj_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537102 340113 kuberuntime_gc.go:170] Failed to stop sandbox "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "k8s-sla-wqm6b_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.537724 340113 kuberuntime_gc.go:170] Failed to stop sandbox "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "kibana-k8s-844d67476b-djqbf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.538371 340113 kuberuntime_gc.go:170] Failed to stop sandbox "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-applog-collection-rndhr_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
E0326 19:16:59.539003 340113 kuberuntime_gc.go:170] Failed to stop sandbox "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" before removing: rpc error: code = Unknown desc = networkPlugin kubenet failed to teardown pod "filebeat-opslog-collection-c59hf_xxx-ops" network: kubenet needs a PodCIDR to tear down pods
# grep "decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.135:decef236193c498235ab5efc33498d06abc34bea58ee7a68d1110228e4e59df2
[root@sh-saas-k8s1-node-dev-11 ~]# grep "d54dd8336cdffd70a9266c24ab3aeff70a4fd1ab902b58afbb925449aba5a2bd" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "cf15243eabb67c87bd93078e7a88b833996dda18d39c479363c62d0dd6ae391d" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.217:a1c4b1a54172d325df761de068e1ccb37040bfd7c175539912fa60232eca9b5e
[root@sh-saas-k8s1-node-dev-11 ~]# grep "950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.130:950b9e02d470d2a3bf7c39100827b0b49ef00f251d4abf354069c78bc25e0a5f
[root@sh-saas-k8s1-node-dev-11 ~]# grep "8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.134:8190a101707a17793e8cfd35485785a9610d9c524f7f041b7dced457e79268e5
[root@sh-saas-k8s1-node-dev-11 ~]# grep "7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.131:7e7a27ecd60f42446fe5ac4709e444f125ac88d6810de9dfd71e5721fdad0d71
[root@sh-saas-k8s1-node-dev-11 ~]# grep "79dee3630c194240c2fcb631e5fa560fc89ba52820ef924092dd3ae980e85df3" /var/lib/cni/networks/kubenet/ -r
[root@sh-saas-k8s1-node-dev-11 ~]# grep "4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.132:4988eaaf02d8cd2164f81a94b264a7b6e03cf87cb0b3a76ae74679f1bd5d3e97
[root@sh-saas-k8s1-node-dev-11 ~]# grep "0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233" /var/lib/cni/networks/kubenet/ -r
/var/lib/cni/networks/kubenet/10.253.6.235:0a917f395c84f42f6d060bee9bcbac403c396dceec88e7d4c9301493a7ad9233
发现的确有一些ip中报错的容器id,出现在报错的日志中。
是不是这个错误导致的ip泄漏呢?
通过kuberuntime_gc.go:170
,可以知道这个是gc相关代码报的错。
kubelet默认每隔一分钟会进行镜像和容器的gc。其中容器gc任务是根据 /var/lib/dockershim/sandbox/
下的文件列表–sanbox容器id列表, 删除一些已经退出的容器和没有关联任何pod的容器或者pod是正在被删除的,还包括删除不存在的sandbox。
为什么日志中的sandbox容器会不存在?
由于旧版本的docker,containerd运行再dockerd下面。新版本containerd是独立运行的,且docker开启了live-restore。所以执行升级必须停掉在运行的容器,否则升级之后这些容器成为孤儿容器–无法进行管理。这里执行停掉和删除所有的容器,如果只是停止容器,升级之后由于containerd的工作目录换了,docker ps也查不到升级前停止的容器。
这些sandbox都是手动执行docker命名停掉并删除的,所以这些sandbox容器都不存在。
kubelet回收不存在的sandbox代码:
pkg\kubelet\kuberuntime\kuberuntime_gc.go
// removeSandbox removes the sandbox by sandboxID.
func (cgc *containerGC) removeSandbox(sandboxID string) {
klog.V(4).Infof("Removing sandbox %q", sandboxID)
// In normal cases, kubelet should've already called StopPodSandbox before
// GC kicks in. To guard against the rare cases where this is not true, try
// stopping the sandbox before removing it.
if err := cgc.client.StopPodSandbox(sandboxID); err != nil {
klog.Errorf("Failed to stop sandbox %q before removing: %v", sandboxID, err)
return
}
if err := cgc.client.RemovePodSandbox(sandboxID); err != nil {
klog.Errorf("Failed to remove sandbox %q: %v", sandboxID, err)
}
}
而 cgc.client.StopPodSandbox会调用cri api的StopPodSandbox,而我们用的cri是dockershim, dockershim停止sandbox还会调用cni进行ip回收。
pkg\kubelet\dockershim\docker_sandbox.go
// StopPodSandbox stops the sandbox. If there are any running containers in the
// sandbox, they should be force terminated.
// TODO: This function blocks sandbox teardown on networking teardown. Is it
// better to cut our losses assuming an out of band GC routine will cleanup
// after us?
func (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {
.......
// WARNING: The following operations made the following assumption:
// 1. kubelet will retry on any error returned by StopPodSandbox.
// 2. tearing down network and stopping sandbox container can succeed in any sequence.
// This depends on the implementation detail of network plugin and proper error handling.
// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet
// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox
// since it is stopped. With empty network namespace, CNI bridge plugin will conduct best
// effort clean up and will not return error.
errList := []error{}
ready, ok := ds.getNetworkReady(podSandboxID)
if !hostNetwork && (ready || !ok) {
// Only tear down the pod network if we haven't done so already
cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
//这里调用cni回收ip
err := ds.network.TearDownPod(namespace, name, cID)
if err == nil {
ds.setNetworkReady(podSandboxID, false)
} else {
errList = append(errList, err)
}
}
if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
// Do not return error if the container does not exist
if !libdocker.IsContainerNotFoundError(err) {
klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
errList = append(errList, err)
} else {
// remove the checkpoint for any sandbox that is not found in the runtime
ds.checkpointManager.RemoveCheckpoint(podSandboxID)
}
}
而kubenet进行teardown之前会检查netConfig是否初始化–dockershim是否调用UpdateRuntimeConfig
pkg\kubelet\dockershim\network\kubenet\kubenet_linux.go
func (plugin *kubenetNetworkPlugin) TearDownPod(namespace string, name string, id kubecontainer.ContainerID) error {
start := time.Now()
defer func() {
klog.V(4).Infof("TearDownPod took %v for %s/%s", time.Since(start), namespace, name)
}()
if plugin.netConfig == nil {
return fmt.Errorf("kubenet needs a PodCIDR to tear down pods")
}
if err := plugin.teardown(namespace, name, id); err != nil {
return err
}
// Need to SNAT outbound traffic from cluster
if err := plugin.ensureMasqRule(); err != nil {
klog.Errorf("Failed to ensure MASQ rule: %v", err)
}
return nil
}
gc是循环执行的,像这样kubenet未初始化,导致未回收ip,等kubenet初始化之后,ip不就回收了。但是发现过了很久,ip文件一直存在。
为什么ip一直不回收?
再来看dockershim的StopPodSandbox代码,当调用ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod)
,docker会返回container not found错误
,然后 执行ds.checkpointManager.RemoveCheckpoint(podSandboxID)
–删除/var/lib/dockershim/sandbox/{容器id}
文件。即无论网络插件是否调用成功都会执行删除/var/lib/dockershim/sandbox/{容器id}
文件–执行ds.checkpointManager.RemoveCheckpoint(podSandboxID)
。导致后续gc,不会再对这个sandbox进行回收–ip不会被回收。
ready, ok := ds.getNetworkReady(podSandboxID)
if !hostNetwork && (ready || !ok) {
// Only tear down the pod network if we haven't done so already
cID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)
//这里调用cni回收ip
err := ds.network.TearDownPod(namespace, name, cID)
if err == nil {
ds.setNetworkReady(podSandboxID, false)
} else {
errList = append(errList, err)
}
}
// 由于容器已经不存在了,这里会返回ContainerNotFoundError
if err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {
// Do not return error if the container does not exist
if !libdocker.IsContainerNotFoundError(err) {
klog.Errorf("Failed to stop sandbox %q: %v", podSandboxID, err)
errList = append(errList, err)
} else {
// remove the checkpoint for any sandbox that is not found in the runtime
ds.checkpointManager.RemoveCheckpoint(podSandboxID)
}
}
日志中有的dockers id找不到ip,而这些pod的网络模式不是host,为什么呢?可能是后面的gc回收成功了,目前还未找到原因。
4 解决办法
4.1 临时解决方法
手动进行回收ip
#要求本地有kubectl
ips=$(kubectl get pod -A -o custom-columns=:status.podIP --no-headers --field-selector=spec.nodeName=10.11.96.29 |grep -v 10.11.96.29|grep -v "<none>"); \
alloc_ips=$(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock" ); \
for ip in $alloc_ips;do \
if ! echo $ips | grep "$ip" &>/dev/null;then \
echo $ip; \
docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip |sed 's/\r//); \
if [ -n "$docker_id" ];then \
rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo; \
fi; \
rm -f /var/lib/cni/networks/kubenet/$ip; \
fi; \
done
#推荐
all_containers=$(docker ps -a -q);for ip in $(ls /var/lib/cni/networks/kubenet/ |grep -v -E "last_reserved_ip.0|lock");do docker_id=$(head -n 1 /var/lib/cni/networks/kubenet/$ip| sed 's/\r//'); if [ -z "${docker_id}" -o -z "$(echo ${all_containers} | grep "${docker_id:0:8}")" ];then echo $ip; if [ -n "${docker_id}" ];then rm -f /var/lib/cni/cache/results/kubenet-${docker_id}-eth0 /var/lib/cni/cache/results/kubenet-loopback-${docker_id}-lo;fi; rm -f /var/lib/cni/networks/kubenet/$ip; fi;done
4.2 长效解决方案
kubelet角度:
很明显最佳的解决方案是kubelet等待cni的podcidr设置成功后再进行停止容器操作。
或者dockershim在停止容器出错时候就返回,不会删除checkpoint文件,这样后续gc就能回收ip了。
规避角度:
手动停止并删除容器后,再删除/var/lib/cni/cache/results和/var/lib/cni/networks/kubenet目录,这样强制回收所有ip。
不用担心这样操作会有问题,cni插件初始化时候还会自动生成这些文件夹。
辅助手段
写个controller,部署成daemonset,对每个节点分配的ip和使用的ip进行比对,发现有泄漏,手动回收。但这样会与kubelet的调用cni,存在竞争情况,边界情况得考虑进去。
5 总结
kubenet网络模式ip泄漏,是由于kubelet启动,执行kubenet的初始化操作与执行gc同时进行,如果gc在kubenet插件初始化之前就执行,导致kubenet不回收ip。且由于dockershim执行停止sanbox容器,无论是否成功都会删除checkpoint文件,导致后续gc不再进行这个sandbox的回收,导致ip文件一直存在。
触发场景
升级docker步骤:先驱逐node上的pod,然后停止kubelet,接着停止所有docker容器并删除,升级docker版本。
由于驱逐node上的pod,但是daemonset的pod没有驱逐,所以要手动停止node上的所有容器。
手动停止docker,并不会调用cni插件来回收ip,也不会删除dockershim checkpoint文件。
可以联想到,如果node down掉,也可能触发类似的情况–并未测试。
有关issue
搜到了部分关于network plugin is not ready: kubenet does not have netConfig. This is most likely due to lack of PodCIDR
的讨论 https://github.com/kubernetes/kubernetes/issues/32900#issuecomment-249086286,大概官方的意思是没有必要问了kubenet而改变顺序,因为其他cni插件不依赖kubelet设置podcidr。
pod创建后又立马删除会触发ip泄漏,但是并没有任何修复
https://github.com/kubernetes/kubernetes/issues/86944
docker重启的时候kubenet出现的ip泄漏问题
https://github.com/kubernetes/kubernetes/issues/34278
调度器是否要对当前node可分配ip进行筛选
https://github.com/kubernetes/kubernetes/issues/21656
扩展
由于触发条件,是使用dockershim且使用kubenet模式。
dockershim要被废弃掉,想要官方修复这种问题,基本没有可能。
长远考虑还是使用其它container runtime,比如containerd。但是换container runtime,必须换cni,kubenet只支持docker。