为什么被驱逐evicted pod未被删除以及如何清理

最近我发现了一个出乎意料的现象:当deployment的replicas缩减为0时,被kubelet驱逐的pod并没有被自动删除。通常情况下,当deployment有副本数时,被驱逐的pod不会被删除,这与我的预期相符。然而,在replicas为0的情况下,驱逐的pod没有被删除,这让我感到意外。这篇文章来讨论evicted pod移除相关的话题。

本文kubernetes版本为1.23。

这里我进行复现deployment副本数为0,但是evicted pod未被删除。

1.定义一个ephemeral-storage限制为200M的deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test
  name: test
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test
    spec:
      containers:
      - command:
        - tail
        - -f
        - /dev/null
        image: progrium/stress:latest
        imagePullPolicy: Always
        name: stress
        resources:
          limits:
            ephemeral-storage: 200M

2.pod里容器写入300M文件,等待pod被kubelet驱逐

1
2
3
4
5
6
7
# kubectl exec -it  test-d44fbc464-7t77k  bash 
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@test-d44fbc464-7t77k:/# dd if=/dev/zero of=sdad bs=1M count=300
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 0.253303 s, 1.2 GB/s
root@test-d44fbc464-7t77k:/# command terminated with exit code 137

3.查看pod

1
2
3
# kubectl get pod  -o wide
test-d44fbc464-7t77k     0/1     Error     0             16h   10.26.124.222   10.11.251.6   <none>           <none>
test-d44fbc464-vk9wz     1/1     Running   0             71s   10.26.124.148   10.11.251.6   <none>           <none>

4.将deployment scale到0

1
# kubectl scale deployment test --replicas=0

5.查看pod和deployment

1
2
3
4
5
# kubectl get pod  -o wide
test-d44fbc464-7t77k     0/1     Error     0             16h   10.26.124.222   10.11.251.6   <none>           <none>
# kubectl get deployment test  -o wide 
NAME   READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                   SELECTOR
test   0/0     0            0           19h   stress       progrium/stress:latest   app=test

pod的status.phase为Failed且status.reason为Evicted,这样的pod称为evicted pod。它的ip已经释放了,但是还在status.podIP里,所以有可能出现多个pod同一ip现象。这样的pod是被kubelet驱逐,而不是在apiserver执行evict(比如kubectl drain)。

有两种情况pod会被kubelet驱逐:

  1. pod使用资源超过了限制的大小(比如pod的container的磁盘使用量超过ephemeral-storage的limit)。
  2. 节点的剩余资源低于–eviction-hard或–eviction-soft,kubelet会对节点上的pod执行驱逐。

当然直接方法是kubectl delete执行删除。

删除集群里所有的evicted pod

1
kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " +  .metadata.namespace'  |   xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'

那么还有没有其他方式删除方法呢?deployment执行rollout update会不会移除evicted pod?删除evicted pod对应的replicaset会不会移除evicted pod?

带者这些问题,我们从实践中找答案。

答案是可以,因为kubectl delete的--cascade为"Background",即kucectl会先删除replicaset,然后kube-controller-manager中的generic-garbage-collector会移除被删除replicaset的所有pod(ownerReference为这个replicaset的pod)。

1
2
3
# kubectl delete rs test-d44fbc464 
replicaset.apps "test-d44fbc464" deleted
# kubectl get pod  -o wide |grep test-d44fbc464

答案是可以,但是需要执行1到spec.revisionHistoryLimit次rollout update ,直到evicted pod对应的replicaset被删除。spec.revisionHistoryLimit决定有多少个replicaset保留,当deployment下非当前版本的replicaset数量超过这个值,则最老的replicaset将会被删除。所以当evicted pod对应的replicaset最老时,evicted pod会随着replicaset被删除而删除。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# kubectl get pod  -o wide |grep test
test-d44fbc464-5m82k     1/1     Running   0             75s     10.26.125.154   10.11.251.6   <none>           <none>
test-d44fbc464-f5tqz     0/1     Error     0             3m36s   10.26.124.222   10.11.251.6   <none>           <none>
# kubectl rollout restart deployment test 
deployment.apps/test restarted
....
# kubectl rollout restart deployment test 
deployment.apps/test restarted
# kubectl get rs  --sort-by=metadata.creationTimestamp |grep test
test-d44fbc464     0         0         0       44m
test-769597d49     0         0         0       21m
test-758bb4d9dc    0         0         0       20m
test-c4b8b4568     0         0         0       18m
test-567f5bf464    0         0         0       17m
test-5c4566f749    0         0         0       15m
test-7fc4c496c4    0         0         0       12m
test-5786555f      0         0         0       11m
test-6c458f479b    0         0         0       11m
test-5ff4795db6    0         0         0       10m
test-fdcd4585f     1         1         1       97s
# kubectl rollout restart deployment test 
deployment.apps/test restarted
# kubectl get rs  --sort-by=metadata.creationTimestamp |grep test
test-769597d49     0         0         0       23m
test-758bb4d9dc    0         0         0       22m
test-c4b8b4568     0         0         0       20m
test-567f5bf464    0         0         0       19m
test-5c4566f749    0         0         0       17m
test-7fc4c496c4    0         0         0       14m
test-5786555f      0         0         0       13m
test-6c458f479b    0         0         0       13m
test-5ff4795db6    0         0         0       12m
test-fdcd4585f     0         0         0       3m45s
test-6d657bcc95    1         1         1       63s
# kubectl get pod  -o wide  -w |grep test-d44fbc464

被驱逐的pod一般情况下不会被删除,直到这样的pod数量超过了--terminated-pod-gc-threshold(默认值为12500),kube-controller-manager里的pod-garbage-collector controller才会将它们删除。即集群中phase为Failed或Succeeded的pod数量超过--terminated-pod-gc-threshold,pod-garbage-collector controller才会执行删除这些pod操作。

由于将deployment的replicas设置为0,只是将当前版本的replicaset的replicas设置为0,并未删除replicaset,所以不会删除驱逐的pod。

ReplicaSet的status.availableReplicas不会包含被删除和phase为Failed和phase为Succeeded的pod。而evicted pod的phase为Failed,所以就被忽略,即ReplicaSet统计自己控制的pod,会排除被删除和phase为Failed和phase为Succeeded的pod。

这里filteredPods就是Replicaset控制的pod列表,其中执行controller.FilterActivePods过滤掉所有inactive pods(被删除和phase为Failed和phase为Succeeded的pod)。

pkg/controller/replicaset/replica_set.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
	// Ignore inactive pods.
	filteredPods := controller.FilterActivePods(allPods)

	// NOTE: filteredPods are pointing to objects from cache - if you need to
	// modify them, you need to copy it first.
	filteredPods, err = rsc.claimPods(ctx, rs, selector, filteredPods)
	if err != nil {
		return err
	}

	var manageReplicasErr error
	if rsNeedsSync && rs.DeletionTimestamp == nil {
		manageReplicasErr = rsc.manageReplicas(ctx, filteredPods, rs)
	}

pkg/controller/controller_utils.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// FilterActivePods returns pods that have not terminated.
func FilterActivePods(pods []*v1.Pod) []*v1.Pod {
	var result []*v1.Pod
	for _, p := range pods {
		if IsPodActive(p) {
			result = append(result, p)
		} else {
			klog.V(4).Infof("Ignoring inactive pod %v/%v in state %v, deletion time %v",
				p.Namespace, p.Name, p.Status.Phase, p.DeletionTimestamp)
		}
	}
	return result
}

func IsPodActive(p *v1.Pod) bool {
	return v1.PodSucceeded != p.Status.Phase &&
		v1.PodFailed != p.Status.Phase &&
		p.DeletionTimestamp == nil
}

删除被驱逐的pod(evicted pod)方法:

  1. 直接删除这个pod。
  2. 对于deployment,可以删除pod对应的replicaset或直接删除这个deployment(不推荐,除非replicas为0)。
  3. 对于deployment,可以触发多次rollout update,让evicted pod对应的replicaset被deployment controller删除。
  4. 设置kube-controller-manager的--terminated-pod-gc-threshold为较小的值,更容易触发pod-garbage-collector controller删除phase为Failed或Succeeded的pod。

相关内容