Why Evicted Pods are not deleted and How to cleanup
Recently, I encountered an unexpected phenomenon. After scaling down the replicas of a deployment to 0, the evicted pods were not being deleted. Typically, when there are replicas in a deployment, evicted pods are not deleted, which aligns with my expectations. This article discusses topics related to the removal of evicted pods.
The Kubernetes version used in this article is 1.23.
1 Phenomenon
Here, I will reproduce the scenario where the replicas of a deployment are set to 0, but the evicted pods are not deleted.
- Define a deployment with an ephemeral-storage limit of 200M:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test
name: test
namespace: default
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: test
template:
metadata:
creationTimestamp: null
labels:
app: test
spec:
containers:
- command:
- tail
- -f
- /dev/null
image: progrium/stress:latest
imagePullPolicy: Always
name: stress
resources:
limits:
ephemeral-storage: 200M
- Write a 300M file inside the container of the pod and wait for the pod to be evicted:
# kubectl exec -it test-d44fbc464-7t77k bash
root@test-d44fbc464-7t77k:/# dd if=/dev/zero of=sdad bs=1M count=300
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 0.253303 s, 1.2 GB/s
root@test-d44fbc464-7t77k:/# command terminated with exit code 137
- Check the pod:
# kubectl get pod -o wide
test-d44fbc464-7t77k 0/1 Error 0 16h 10.26.124.222 10.11.251.6 <none> <none>
test-d44fbc464-vk9wz 1/1 Running 0 71s 10.26.124.148 10.11.251.6 <none> <none>
- Scale the deployment down to 0:
# kubectl scale deployment test --replicas=0
- Check the pod and deployment:
# kubectl get pod -o wide
test-d44fbc464-7t77k 0/1 Error 0 16h 10.26.124.222 10.11.251.6 <none> <none>
# kubectl get deployment test -o wide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
test 0/0 0 0 19h stress progrium/stress:latest app=test
2 What is an Evicted Pod?
A pod with status.phase
set to Failed and status.reason
set to Evicted is referred to as an evicted pod. Its IP has been released, but it still appears in status.podIP
, leading to the possibility of multiple pods sharing the same IP. Such pods are evicted by kubelet rather than being evicted through API server actions like kubectl drain
.
3 Reasons for Generating Evicted Pods
There are two situations in which pods are evicted by kubelet:
- The pod exceeds the specified resource limits (e.g., the container’s disk usage surpasses the
ephemeral-storage
limit). - If the remaining resources on the node fall below the values set by
--eviction-hard
or--eviction-soft
, kubelet will evict pods on that node.
4 How to Delete Evicted Pods
Certainly, the direct method is to use kubectl delete
for removal.
delete all evicted pods in cluster
kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'
Are there any other ways to delete evicted pods? Does Executing a Rollout Update on a Deployment Remove Evicted Pods? Does Deleting the Replicaset Corresponding to Evicted Pods Remove Them?
With these questions in mind, let’s find answers through practical experiments.
4.1 Does Deleting the Replicaset Corresponding to Evicted Pods Remove Evicted Pods?
Yes, it is possible. This is because the --cascade
option in kubectl delete
is set to “Background,” meaning that kubectl
will first delete the replicaset. Subsequently, the kube-controller-manager’s generic garbage collector will remove all pods (with ownerReference
pointing to the deleted replicaset).
# kubectl delete rs test-d44fbc464
replicaset.apps "test-d44fbc464" deleted
# kubectl get pod -o wide |grep test-d44fbc464
4.2 Does Executing a Rollout Update on a Deployment Remove Evicted Pods?
Yes, it is possible, but you need to execute rollout update
1 to spec.revisionHistoryLimit
times until the replicaset corresponding to the evicted pod is deleted. The spec.revisionHistoryLimit
determines how many replicasets are retained. When the number of replicasets under a deployment, excluding the current version, exceeds this limit, the oldest replicaset is deleted. Therefore, when the replicaset corresponding to the evicted pod is the oldest, the evicted pod will be deleted along with the replicaset.
# kubectl get pod -o wide |grep test
test-d44fbc464-5m82k 1/1 Running 0 75s 10.26.125.154 10.11.251.6 <none> <none>
test-d44fbc464-f5tqz 0/1 Error 0 3m36s 10.26.124.222 10.11.251.6 <none> <none>
# kubectl rollout restart deployment test
deployment.apps/test restarted
....
# kubectl rollout restart deployment test
deployment.apps/test restarted
# kubectl get rs --sort-by=metadata.creationTimestamp |grep test
test-d44fbc464 0 0 0 44m
test-769597d49 0 0 0 21m
test-758bb4d9dc 0 0 0 20m
test-c4b8b4568 0 0 0 18m
test-567f5bf464 0 0 0 17m
test-5c4566f749 0 0 0 15m
test-7fc4c496c4 0 0 0 12m
test-5786555f 0 0 0 11m
test-6c458f479b 0 0 0 11m
test-5ff4795db6 0 0 0 10m
test-fdcd4585f 1 1 1 97s
# kubectl rollout restart deployment test
deployment.apps/test restarted
# kubectl get rs --sort-by=metadata.creationTimestamp |grep test
test-769597d49 0 0 0 23m
test-758bb4d9dc 0 0 0 22m
test-c4b8b4568 0 0 0 20m
test-567f5bf464 0 0 0 19m
test-5c4566f749 0 0 0 17m
test-7fc4c496c4 0 0 0 14m
test-5786555f 0 0 0 13m
test-6c458f479b 0 0 0 13m
test-5ff4795db6 0 0 0 12m
test-fdcd4585f 0 0 0 3m45s
test-6d657bcc95 1 1 1 63s
# kubectl get pod -o wide -w |grep test-d44fbc464
5 Why Aren’t Evicted Pods Deleted?
Evicted pods are generally not deleted immediately. They persist until the number of such pods exceeds the --terminated-pod-gc-threshold
(default value is 12500). Only then will the pod-garbage-collector
controller in kube-controller-manager
delete them. In other words, the pod-garbage-collector
controller will execute deletion operations only when the number of pods with the phase Failed or Succeeded surpasses the --terminated-pod-gc-threshold
in the cluster.
6 Why Scaling Deployment to 0 Doesn’t Delete Evicted Pods?
Setting the replicas of a deployment to 0 merely adjusts the replicas of the current version of the replicaset to 0 without deleting the replicaset. Therefore, evicted pods are not deleted in this scenario.
6.1 Why Setting Replicas to 0 on a ReplicaSet Doesn’t Delete Evicted Pods?
The status.availableReplicas
of a ReplicaSet does not include deleted pods or pods with a phase of Failed or Succeeded. Since the phase of evicted pods is Failed, they are ignored. In other words, a ReplicaSet counts only the active pods it controls, excluding deleted pods and those with a phase of Failed or Succeeded.
Here, filteredPods
represents the list of pods controlled by the ReplicaSet. The controller.FilterActivePods
function filters out all inactive pods (deleted or with a phase of Failed or Succeeded).
pkg/controller/replicaset/replica_set.go
// Ignore inactive pods.
filteredPods := controller.FilterActivePods(allPods)
// NOTE: filteredPods are pointing to objects from cache - if you need to
// modify them, you need to copy it first.
filteredPods, err = rsc.claimPods(ctx, rs, selector, filteredPods)
if err != nil {
return err
}
var manageReplicasErr error
if rsNeedsSync && rs.DeletionTimestamp == nil {
manageReplicasErr = rsc.manageReplicas(ctx, filteredPods, rs)
}
pkg/controller/controller_utils.go
// FilterActivePods returns pods that have not terminated.
func FilterActivePods(pods []*v1.Pod) []*v1.Pod {
var result []*v1.Pod
for _, p := range pods {
if IsPodActive(p) {
result = append(result, p)
} else {
klog.V(4).Infof("Ignoring inactive pod %v/%v in state %v, deletion time %v",
p.Namespace, p.Name, p.Status.Phase, p.DeletionTimestamp)
}
}
return result
}
func IsPodActive(p *v1.Pod) bool {
return v1.PodSucceeded != p.Status.Phase &&
v1.PodFailed != p.Status.Phase &&
p.DeletionTimestamp == nil
}
7 Summary
Methods to delete evicted pods:
- Directly delete the evicted pod.
- For a deployment, you can delete the replicaset corresponding to the pod or directly delete the deployment (not recommended unless replicas are set to 0).
- For a deployment, trigger multiple
rollout update
operations to allow the deployment controller to delete the replicaset corresponding to the evicted pod. - Set
--terminated-pod-gc-threshold
inkube-controller-manager
to a smaller value to more easily trigger thepod-garbage-collector
controller to delete pods with a phase of Failed or Succeeded.