最近我发现了一个出乎意料的现象:当deployment的replicas缩减为0时,被kubelet驱逐的pod并没有被自动删除。通常情况下,当deployment有副本数时,被驱逐的pod不会被删除,这与我的预期相符。然而,在replicas为0的情况下,驱逐的pod没有被删除,这让我感到意外。这篇文章来讨论evicted pod移除相关的话题。
本文kubernetes版本为1.23。
这里我进行复现deployment副本数为0,但是evicted pod未被删除。
1.定义一个ephemeral-storage限制为200M的deployment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test
name: test
namespace: default
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: test
template:
metadata:
creationTimestamp: null
labels:
app: test
spec:
containers:
- command:
- tail
- -f
- /dev/null
image: progrium/stress:latest
imagePullPolicy: Always
name: stress
resources:
limits:
ephemeral-storage: 200M
|
2.pod里容器写入300M文件,等待pod被kubelet驱逐
1
2
3
4
5
6
7
| # kubectl exec -it test-d44fbc464-7t77k bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@test-d44fbc464-7t77k:/# dd if=/dev/zero of=sdad bs=1M count=300
300+0 records in
300+0 records out
314572800 bytes (315 MB) copied, 0.253303 s, 1.2 GB/s
root@test-d44fbc464-7t77k:/# command terminated with exit code 137
|
3.查看pod
1
2
3
| # kubectl get pod -o wide
test-d44fbc464-7t77k 0/1 Error 0 16h 10.26.124.222 10.11.251.6 <none> <none>
test-d44fbc464-vk9wz 1/1 Running 0 71s 10.26.124.148 10.11.251.6 <none> <none>
|
4.将deployment scale到0
1
| # kubectl scale deployment test --replicas=0
|
5.查看pod和deployment
1
2
3
4
5
| # kubectl get pod -o wide
test-d44fbc464-7t77k 0/1 Error 0 16h 10.26.124.222 10.11.251.6 <none> <none>
# kubectl get deployment test -o wide
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
test 0/0 0 0 19h stress progrium/stress:latest app=test
|
pod的status.phase为Failed且status.reason为Evicted,这样的pod称为evicted pod。它的ip已经释放了,但是还在status.podIP里,所以有可能出现多个pod同一ip现象。这样的pod是被kubelet驱逐,而不是在apiserver执行evict(比如kubectl drain)。
有两种情况pod会被kubelet驱逐:
- pod使用资源超过了限制的大小(比如pod的container的磁盘使用量超过ephemeral-storage的limit)。
- 节点的剩余资源低于–eviction-hard或–eviction-soft,kubelet会对节点上的pod执行驱逐。
当然直接方法是kubectl delete执行删除。
删除集群里所有的evicted pod
1
| kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'
|
那么还有没有其他方式删除方法呢?deployment执行rollout update会不会移除evicted pod?删除evicted pod对应的replicaset会不会移除evicted pod?
带者这些问题,我们从实践中找答案。
答案是可以,因为kubectl delete的--cascade
为"Background",即kucectl会先删除replicaset,然后kube-controller-manager中的generic-garbage-collector会移除被删除replicaset的所有pod(ownerReference为这个replicaset的pod)。
1
2
3
| # kubectl delete rs test-d44fbc464
replicaset.apps "test-d44fbc464" deleted
# kubectl get pod -o wide |grep test-d44fbc464
|
答案是可以,但是需要执行1到spec.revisionHistoryLimit次rollout update
,直到evicted pod对应的replicaset被删除。spec.revisionHistoryLimit决定有多少个replicaset保留,当deployment下非当前版本的replicaset数量超过这个值,则最老的replicaset将会被删除。所以当evicted pod对应的replicaset最老时,evicted pod会随着replicaset被删除而删除。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| # kubectl get pod -o wide |grep test
test-d44fbc464-5m82k 1/1 Running 0 75s 10.26.125.154 10.11.251.6 <none> <none>
test-d44fbc464-f5tqz 0/1 Error 0 3m36s 10.26.124.222 10.11.251.6 <none> <none>
# kubectl rollout restart deployment test
deployment.apps/test restarted
....
# kubectl rollout restart deployment test
deployment.apps/test restarted
# kubectl get rs --sort-by=metadata.creationTimestamp |grep test
test-d44fbc464 0 0 0 44m
test-769597d49 0 0 0 21m
test-758bb4d9dc 0 0 0 20m
test-c4b8b4568 0 0 0 18m
test-567f5bf464 0 0 0 17m
test-5c4566f749 0 0 0 15m
test-7fc4c496c4 0 0 0 12m
test-5786555f 0 0 0 11m
test-6c458f479b 0 0 0 11m
test-5ff4795db6 0 0 0 10m
test-fdcd4585f 1 1 1 97s
# kubectl rollout restart deployment test
deployment.apps/test restarted
# kubectl get rs --sort-by=metadata.creationTimestamp |grep test
test-769597d49 0 0 0 23m
test-758bb4d9dc 0 0 0 22m
test-c4b8b4568 0 0 0 20m
test-567f5bf464 0 0 0 19m
test-5c4566f749 0 0 0 17m
test-7fc4c496c4 0 0 0 14m
test-5786555f 0 0 0 13m
test-6c458f479b 0 0 0 13m
test-5ff4795db6 0 0 0 12m
test-fdcd4585f 0 0 0 3m45s
test-6d657bcc95 1 1 1 63s
# kubectl get pod -o wide -w |grep test-d44fbc464
|
被驱逐的pod一般情况下不会被删除,直到这样的pod数量超过了--terminated-pod-gc-threshold
(默认值为12500),kube-controller-manager里的pod-garbage-collector controller才会将它们删除。即集群中phase为Failed或Succeeded的pod数量超过--terminated-pod-gc-threshold
,pod-garbage-collector controller才会执行删除这些pod操作。
由于将deployment的replicas设置为0,只是将当前版本的replicaset的replicas设置为0,并未删除replicaset,所以不会删除驱逐的pod。
ReplicaSet的status.availableReplicas不会包含被删除和phase为Failed和phase为Succeeded的pod。而evicted pod的phase为Failed,所以就被忽略,即ReplicaSet统计自己控制的pod,会排除被删除和phase为Failed和phase为Succeeded的pod。
这里filteredPods
就是Replicaset控制的pod列表,其中执行controller.FilterActivePods
过滤掉所有inactive pods(被删除和phase为Failed和phase为Succeeded的pod)。
pkg/controller/replicaset/replica_set.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| // Ignore inactive pods.
filteredPods := controller.FilterActivePods(allPods)
// NOTE: filteredPods are pointing to objects from cache - if you need to
// modify them, you need to copy it first.
filteredPods, err = rsc.claimPods(ctx, rs, selector, filteredPods)
if err != nil {
return err
}
var manageReplicasErr error
if rsNeedsSync && rs.DeletionTimestamp == nil {
manageReplicasErr = rsc.manageReplicas(ctx, filteredPods, rs)
}
|
pkg/controller/controller_utils.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| // FilterActivePods returns pods that have not terminated.
func FilterActivePods(pods []*v1.Pod) []*v1.Pod {
var result []*v1.Pod
for _, p := range pods {
if IsPodActive(p) {
result = append(result, p)
} else {
klog.V(4).Infof("Ignoring inactive pod %v/%v in state %v, deletion time %v",
p.Namespace, p.Name, p.Status.Phase, p.DeletionTimestamp)
}
}
return result
}
func IsPodActive(p *v1.Pod) bool {
return v1.PodSucceeded != p.Status.Phase &&
v1.PodFailed != p.Status.Phase &&
p.DeletionTimestamp == nil
}
|
删除被驱逐的pod(evicted pod)方法:
- 直接删除这个pod。
- 对于deployment,可以删除pod对应的replicaset或直接删除这个deployment(不推荐,除非replicas为0)。
- 对于deployment,可以触发多次rollout update,让evicted pod对应的replicaset被deployment controller删除。
- 设置kube-controller-manager的
--terminated-pod-gc-threshold
为较小的值,更容易触发pod-garbage-collector controller删除phase为Failed或Succeeded的pod。