大概4个月前在排查cni插件bug导致pod移除失败问题时,梳理了一下kubernetes 1.23版本pod的删除流程。在kubelet日志里遇到经常看到的报错"an error occurred when try to find container",以前看到这样的错误直接忽略,这次下定决心分析一下这个报错的原因。
这里只分析pod移除过程中遇到的"an error occurred when try to find container"错误,其他情况遇到这个错误,原因可能不一样。
环境:kubelet版本为1.23.10 日志级别为4
1 日志报错样例
1
2
3
PLEG执行两次删除导致
E0731 17:20:16.176440 354472 remote_runtime.go:597] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"248fef457ea64c3bffe8f2371626a5a96c14f148f768d947d9b3057c17e89e7a\": not found" containerID="248fef457ea64c3bffe8f2371626a5a96c14f148f768d947d9b3057c17e89e7a"
I0731 17:20:16.176478 354472 pod_container_deletor.go:52] "DeleteContainer returned error" containerID={Type:containerd ID:248fef457ea64c3bffe8f2371626a5a96c14f148f768d947d9b3057c17e89e7a} err="failed to get container status \"248fef457ea64c3bffe8f2371626a5a96c14f148f768d947d9b3057c17e89e7a\": rpc error: code = NotFound desc = an error occurred when try to find container \"248fef457ea64c3bffe8f2371626a5a96c14f148f768d947d9b3057c17e89e7a\": not found"
1
2
housekeeping重新生成podworker执行移除报错
E0731 17:16:43.822332 354472 remote_runtime.go:479] "StopContainer from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"cb2bbdc9ae26e1385e012072a4e0208d91fc027ae1af89cf1432ff3173014d32\": not found" containerID="cb2bbdc9ae26e1385e012072a4e0208d91fc027ae1af89cf1432ff3173014d32"
I0731 17:16:32.394625 354472 kubelet.go:2130] "SyncLoop DELETE" source="api" pods=[default/test-nginx-976fbbd77-m2tqs]
I0731 17:16:32.394654 354472 pod_workers.go:611] "Pod is finished processing, no further updates" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
podConfig感知到pod从apiserver中的移除
1
2
I0731 17:16:32.398009 354472 kubelet.go:2124] "SyncLoop REMOVE" source="api" pods=[default/test-nginx-976fbbd77-m2tqs]
I0731 17:16:32.398032 354472 kubelet.go:1969] "Pod has been deleted and must be killed" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
housekeeping执行移除podWorker和再次创建podWorker,执行terminating pod
I0731 17:16:33.810883 354472 kubelet.go:2202] "SyncLoop (housekeeping)"
I0731 17:16:33.820842 354472 kubelet_pods.go:1082] "Clean up pod workers for terminated pods"
I0731 17:16:33.820869 354472 pod_workers.go:1258] "Pod has been terminated and is no longer known to the kubelet, remove all history" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
......
I0731 17:16:33.820908 354472 kubelet_pods.go:1134] "Clean up orphaned pod containers" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
I0731 17:16:33.820927 354472 pod_workers.go:571] "Pod is being synced for the first time" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
I0731 17:16:33.820938 354472 pod_workers.go:620] "Pod is orphaned and must be torn down" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
0731 17:16:33.821092 354472 pod_workers.go:888] "Processing pod event" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0 updateType=1
I0731 17:16:33.821127 354472 pod_workers.go:1005] "Pod worker has observed request to terminate" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
........
I0731 17:16:33.821138 354472 kubelet.go:1795] "syncTerminatingPod enter" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
I0731 17:16:33.821149 354472 kubelet.go:1804] "Pod terminating with grace period" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0 gracePeriod=1
I0731 17:16:33.821198 354472 kuberuntime_container.go:719] "Killing container with a grace period override" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0 containerName="nginx" containerID="containerd://c234646fffe772e4979bc11398e1b0b8612136c08030486ae1fde12238479b9e" gracePeriod=1
I0731 17:16:33.821212 354472 kuberuntime_container.go:723] "Killing container with a grace period" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0 containerName="nginx" containerID="containerd://c234646fffe772e4979bc11398e1b0b8612136c08030486ae1fde12238479b9e" gracePeriod=1
I0731 17:16:33.821305 354472 event.go:294] "Event occurred" object="default/test-nginx-976fbbd77-m2tqs" kind="Pod" apiVersion="v1" type="Normal" reason="Killing" message="Stopping container nginx"
E0731 17:16:33.822425 354472 remote_runtime.go:479] "StopContainer from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"c234646fffe772e4979bc11398e1b0b8612136c08030486ae1fde12238479b9e\": not found" containerID="c234646fffe772e4979bc11398e1b0b8612136c08030486ae1fde12238479b9e"
I0731 17:16:33.822479 354472 kuberuntime_container.go:732] "Container exited normally" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0 containerName="nginx" containerID="containerd://c234646fffe772e4979bc11398e1b0b8612136c08030486ae1fde12238479b9e"
.......
I0731 17:16:33.892542 354472 kubelet.go:1815] "syncTerminatingPod exit" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
I0731 17:16:33.892552 354472 pod_workers.go:1081] "Pod terminated all orphaned containers successfully and worker can now stop" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
I0731 17:16:33.892567 354472 pod_workers.go:969] "Processing pod event done" pod="default/test-nginx-976fbbd77-m2tqs" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0 updateType=1
housekeeping再次触发,移除podWorker
1
2
3
I0731 17:16:35.810945 354472 kubelet.go:2202] "SyncLoop (housekeeping)"
I0731 17:16:35.820522 354472 kubelet_pods.go:1082] "Clean up pod workers for terminated pods"
I0731 17:16:35.820564 354472 pod_workers.go:1258] "Pod has been terminated and is no longer known to the kubelet, remove all history" podUID=5011243d-6324-41e3-8ee8-9e8bb33f04b0
funcnewPodContainerDeletor(runtimekubecontainer.Runtime,containersToKeepint)*podContainerDeletor{buffer:=make(chankubecontainer.ContainerID,containerDeletorBufferLimit)gowait.Until(func(){for{id:=<-bufferiferr:=runtime.DeleteContainer(id);err!=nil{klog.InfoS("DeleteContainer returned error","containerID",id,"err",err)}}},0,wait.NeverStop)
// removeContainerLog removes the container log.
func(m*kubeGenericRuntimeManager)removeContainerLog(containerIDstring)error{// Use log manager to remove rotated logs.
err:=m.logManager.Clean(containerID)iferr!=nil{returnerr}status,err:=m.runtimeService.ContainerStatus(containerID)iferr!=nil{returnfmt.Errorf("failed to get container status %q: %v",containerID,err)}
4.3 第二次出现
在第一次housekeeping触发时候,podWorker执行停止container时候。
1
E0731 17:16:33.822425 354472 remote_runtime.go:479] "StopContainer from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"c234646fffe772e4979bc11398e1b0b8612136c08030486ae1fde12238479b9e\": not found" containerID="c234646fffe772e4979bc11398e1b0b8612136c08030486ae1fde12238479b9e"
workingPods:=kl.podWorkers.SyncKnownPods(allPods)allPodsByUID:=make(map[types.UID]*v1.Pod)for_,pod:=rangeallPods{allPodsByUID[pod.UID]=pod}// Identify the set of pods that have workers, which should be all pods
// from config that are not terminated, as well as any terminating pods
// that have already been removed from config. Pods that are terminating
// will be added to possiblyRunningPods, to prevent overly aggressive
// cleanup of pod cgroups.
runningPods:=make(map[types.UID]sets.Empty)possiblyRunningPods:=make(map[types.UID]sets.Empty)restartablePods:=make(map[types.UID]sets.Empty)// 对podWorkers中pod进行分类(移除了不在kl.podManager的pod)
// runningPods包含在podWorkers中处于syncPod状态的pod
// possiblyRunningPods包含在podWorkers中处于syncPod状态的pod和TerminatingPod状态的pod
// restartablePods包含在podWorkers中处于TerminatedAndRecreatedPod的pod
foruid,sync:=rangeworkingPods{switchsync{caseSyncPod:runningPods[uid]=struct{}{}possiblyRunningPods[uid]=struct{}{}caseTerminatingPod:possiblyRunningPods[uid]=struct{}{}caseTerminatedAndRecreatedPod:restartablePods[uid]=struct{}{}}}// Stop probing pods that are not running
klog.V(3).InfoS("Clean up probes for terminated pods")// kl.probeManager.workers中的key的pod uid不在possiblyRunningPods中,则发送信号给worker.stopCh,让周期执行的probe停止并删除
kl.probeManager.CleanupPods(possiblyRunningPods)// Terminate any pods that are observed in the runtime but not
// present in the list of known running pods from config.
// 从runtime缓存中获取running pod列表
runningRuntimePods,err:=kl.runtimeCache.GetPods()iferr!=nil{klog.ErrorS(err,"Error listing containers")returnerr}for_,runningPod:=rangerunningRuntimePods{switchworkerState,ok:=workingPods[runningPod.ID];{// runtime中运行的pod也在podWorker中且"在podWorker中处于SyncPod状态",或runtime中运行的pod也在podWorker中且"在podWorker中处于TerminatingPod状态"
caseok&&workerState==SyncPod,ok&&workerState==TerminatingPod:// if the pod worker is already in charge of this pod, we don't need to do anything
continuedefault:// If the pod isn't in the set that should be running and isn't already terminating, terminate
// now. This termination is aggressive because all known pods should already be in a known state
// (i.e. a removed static pod should already be terminating), so these are pods that were
// orphaned due to kubelet restart or bugs. Since housekeeping blocks other config changes, we
// know that another pod wasn't started in the background so we are safe to terminate the
// unknown pods.
// runtime中运行的pod不在kl.podManager中(apiserver中),则执行kl.podWorkers.UpdatePod(UpdateType为kubetypes.SyncPodKill,让podWorker执行Terminating-->terminated)
if_,ok:=allPodsByUID[runningPod.ID];!ok{klog.V(3).InfoS("Clean up orphaned pod containers","podUID",runningPod.ID)one:=int64(1)kl.podWorkers.UpdatePod(UpdatePodOptions{UpdateType:kubetypes.SyncPodKill,RunningPod:runningPod,KillPodOptions:&KillPodOptions{PodTerminationGracePeriodSecondsOverride:&one,},})}}}