Why Evicted Pods are not deleted and How to cleanup

Recently, I encountered an unexpected phenomenon. After scaling down the replicas of a deployment to 0, the evicted pods were not being deleted. Typically, when there are replicas in a deployment, evicted pods are not deleted, which aligns with my expectations. This article discusses topics related to the removal of evicted pods.

The Kubernetes version used in this article is 1.23.

Here, I will reproduce the scenario where the replicas of a deployment are set to 0, but the evicted pods are not deleted.

  1. Define a deployment with an ephemeral-storage limit of 200M:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test
  name: test
  namespace: default
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test
    spec:
      containers:
      - command:
        - tail
        - -f
        - /dev/null
        image: progrium/stress:latest
        imagePullPolicy: Always
        name: stress
        resources:
          limits:
            ephemeral-storage: 200M
  1. Write a 300M file inside the container of the pod and wait for the pod to be evicted:
1
2
3
4
5
6
# kubectl exec -it test-d44fbc464-7t77k bash
root@test-d44fbc464-7t77k:/# dd if=/dev/zero of=sdad bs=1M count=300
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 0.253303 s, 1.2 GB/s
root@test-d44fbc464-7t77k:/# command terminated with exit code 137
  1. Check the pod:
1
2
3
# kubectl get pod -o wide
test-d44fbc464-7t77k     0/1     Error     0             16h   10.26.124.222   10.11.251.6   <none>           <none>
test-d44fbc464-vk9wz     1/1     Running   0             71s   10.26.124.148   10.11.251.6   <none>           <none>
  1. Scale the deployment down to 0:
1
# kubectl scale deployment test --replicas=0
  1. Check the pod and deployment:
1
2
3
4
5
# kubectl get pod -o wide
test-d44fbc464-7t77k     0/1     Error     0             16h   10.26.124.222   10.11.251.6   <none>           <none>
# kubectl get deployment test -o wide
NAME   READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS   IMAGES                   SELECTOR
test   0/0     0            0           19h   stress       progrium/stress:latest   app=test

A pod with status.phase set to Failed and status.reason set to Evicted is referred to as an evicted pod. Its IP has been released, but it still appears in status.podIP, leading to the possibility of multiple pods sharing the same IP. Such pods are evicted by kubelet rather than being evicted through API server actions like kubectl drain.

There are two situations in which pods are evicted by kubelet:

  1. The pod exceeds the specified resource limits (e.g., the container’s disk usage surpasses the ephemeral-storage limit).
  2. If the remaining resources on the node fall below the values set by --eviction-hard or --eviction-soft, kubelet will evict pods on that node.

Certainly, the direct method is to use kubectl delete for removal.

delete all evicted pods in cluster

1
kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " +  .metadata.namespace'  |   xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'

Are there any other ways to delete evicted pods? Does Executing a Rollout Update on a Deployment Remove Evicted Pods? Does Deleting the Replicaset Corresponding to Evicted Pods Remove Them?

With these questions in mind, let’s find answers through practical experiments.

Yes, it is possible. This is because the --cascade option in kubectl delete is set to “Background,” meaning that kubectl will first delete the replicaset. Subsequently, the kube-controller-manager’s generic garbage collector will remove all pods (with ownerReference pointing to the deleted replicaset).

1
2
3
# kubectl delete rs test-d44fbc464 
replicaset.apps "test-d44fbc464" deleted
# kubectl get pod  -o wide |grep test-d44fbc464

Yes, it is possible, but you need to execute rollout update 1 to spec.revisionHistoryLimit times until the replicaset corresponding to the evicted pod is deleted. The spec.revisionHistoryLimit determines how many replicasets are retained. When the number of replicasets under a deployment, excluding the current version, exceeds this limit, the oldest replicaset is deleted. Therefore, when the replicaset corresponding to the evicted pod is the oldest, the evicted pod will be deleted along with the replicaset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# kubectl get pod  -o wide |grep test
test-d44fbc464-5m82k     1/1     Running   0             75s     10.26.125.154   10.11.251.6   <none>           <none>
test-d44fbc464-f5tqz     0/1     Error     0             3m36s   10.26.124.222   10.11.251.6   <none>           <none>
# kubectl rollout restart deployment test 
deployment.apps/test restarted
....
# kubectl rollout restart deployment test 
deployment.apps/test restarted
# kubectl get rs  --sort-by=metadata.creationTimestamp |grep test
test-d44fbc464     0         0         0       44m
test-769597d49     0         0         0       21m
test-758bb4d9dc    0         0         0       20m
test-c4b8b4568     0         0         0       18m
test-567f5bf464    0         0         0       17m
test-5c4566f749    0         0         0       15m
test-7fc4c496c4    0         0         0       12m
test-5786555f      0         0         0       11m
test-6c458f479b    0         0         0       11m
test-5ff4795db6    0         0         0       10m
test-fdcd4585f     1         1         1       97s
# kubectl rollout restart deployment test 
deployment.apps/test restarted
# kubectl get rs  --sort-by=metadata.creationTimestamp |grep test
test-769597d49     0         0         0       23m
test-758bb4d9dc    0         0         0       22m
test-c4b8b4568     0         0         0       20m
test-567f5bf464    0         0         0       19m
test-5c4566f749    0         0         0       17m
test-7fc4c496c4    0         0         0       14m
test-5786555f      0         0         0       13m
test-6c458f479b    0         0         0       13m
test-5ff4795db6    0         0         0       12m
test-fdcd4585f     0         0         0       3m45s
test-6d657bcc95    1         1         1       63s
# kubectl get pod  -o wide  -w |grep test-d44fbc464

Evicted pods are generally not deleted immediately. They persist until the number of such pods exceeds the --terminated-pod-gc-threshold (default value is 12500). Only then will the pod-garbage-collector controller in kube-controller-manager delete them. In other words, the pod-garbage-collector controller will execute deletion operations only when the number of pods with the phase Failed or Succeeded surpasses the --terminated-pod-gc-threshold in the cluster.

Setting the replicas of a deployment to 0 merely adjusts the replicas of the current version of the replicaset to 0 without deleting the replicaset. Therefore, evicted pods are not deleted in this scenario.

The status.availableReplicas of a ReplicaSet does not include deleted pods or pods with a phase of Failed or Succeeded. Since the phase of evicted pods is Failed, they are ignored. In other words, a ReplicaSet counts only the active pods it controls, excluding deleted pods and those with a phase of Failed or Succeeded.

Here, filteredPods represents the list of pods controlled by the ReplicaSet. The controller.FilterActivePods function filters out all inactive pods (deleted or with a phase of Failed or Succeeded).

pkg/controller/replicaset/replica_set.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
	// Ignore inactive pods.
	filteredPods := controller.FilterActivePods(allPods)

	// NOTE: filteredPods are pointing to objects from cache - if you need to
	// modify them, you need to copy it first.
	filteredPods, err = rsc.claimPods(ctx, rs, selector, filteredPods)
	if err != nil {
		return err
	}

	var manageReplicasErr error
	if rsNeedsSync && rs.DeletionTimestamp == nil {
		manageReplicasErr = rsc.manageReplicas(ctx, filteredPods, rs)
	}

pkg/controller/controller_utils.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// FilterActivePods returns pods that have not terminated.
func FilterActivePods(pods []*v1.Pod) []*v1.Pod {
	var result []*v1.Pod
	for _, p := range pods {
		if IsPodActive(p) {
			result = append(result, p)
		} else {
			klog.V(4).Infof("Ignoring inactive pod %v/%v in state %v, deletion time %v",
				p.Namespace, p.Name, p.Status.Phase, p.DeletionTimestamp)
		}
	}
	return result
}

func IsPodActive(p *v1.Pod) bool {
	return v1.PodSucceeded != p.Status.Phase &&
		v1.PodFailed != p.Status.Phase &&
		p.DeletionTimestamp == nil
}

Methods to delete evicted pods:

  1. Directly delete the evicted pod.
  2. For a deployment, you can delete the replicaset corresponding to the pod or directly delete the deployment (not recommended unless replicas are set to 0).
  3. For a deployment, trigger multiple rollout update operations to allow the deployment controller to delete the replicaset corresponding to the evicted pod.
  4. Set --terminated-pod-gc-threshold in kube-controller-manager to a smaller value to more easily trigger the pod-garbage-collector controller to delete pods with a phase of Failed or Succeeded.

Related Content