容器中subpath挂载文件为空文件,bug?

最近遇到一个非常奇怪的现象,在新建的集群中,进行subpath挂载,但是这个subpath在容器中的文件是空的。检查了语法配置没有问题,使用姿势没有问题,也不是configmap不存在的subpath会被挂载为空的bug,issues/54514。而且直接挂载configmap,容器里也能看configmap中的这个subpath的key的内容。感觉像是遇到了bug!
1 现象
下面这个例子中,将configmap的config.conf
挂载为subpath,但是在容器中挂载的文件/etc/kubernetes/config.conf
内容是空的。
# test.yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: test
labels:
app: test
data:
config.conf: |-
test-data
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
labels:
k8s-app: test
name: test
spec:
selector:
matchLabels:
k8s-app: test
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
k8s-app: test
spec:
containers:
- name: test
image: busybox
imagePullPolicy: IfNotPresent
command:
- tail
- -f
- /dev/null
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/kubernetes/config.conf
name: test-config
subPath: config.conf
readOnly: true
volumes:
- name: test-config
configMap:
name: test
# kubectl apply -f test.yaml
#kubectl exec -it -n kube-system test-26qvd -- sh
/ # ls -li /etc/kubernetes/
total 4
3283365 -rw-r--r-- 1 root root 0 Jun 4 13:29 config.conf
1.1 排查过程
由于是新建集群,kubernetes版本是1.33.1,一度怀疑了是kubernetes bug、内核bug、containerd的bug?
尝试过下面的方案:
- 查询如何使用subpath挂载文档、查找kubernetes issue。结果–使用姿势正确、没有相关issue(难道我遇到未发现bug?)。
- 使用kind安装相同版本的kubernetes,发现subpath挂载没有问题。(排除是版本bug问题,难道是配置问题?)
- 更换kind里的kubelet配置,发现问题依旧(kind和新安装的集群内核版本不一样,难道说是内核问题?)。
- 换内核版本,发现问题依旧
- 分析kubelet的subpath相关代码,为kubelet的subpath挂载部分添加debug日志,日志级别设置为
-v=5
。发现kubelet里确实执行了bind挂载,而且在doBindSubPath
方法退出时候挂载还是存在的。 - 替换runc版本(跟kind里一样),但是问题依旧。
2 原理分析
在kubelet启动pod的container时候,kubelet会准备好容器所需要的挂载(比如configmap、secret、projected volume, /etc/reslov.conf、/etc/hosts 、subpath等挂载),然后通过CRI发送创建container的请求给runtime,请求中会包含subpath挂载的宿主机路径和容器中的路径,然后容器会将宿主机路径挂载到容器中。
这里涉及两个目录:
- kubelet保存secret、configmap到主机上目录,
{kubelet root}/pods/{pod uid}/volumes/kubernetes.io~{configmap or secret}/{volume name}
- subpath对应单独的主机上文件,
{kubelet root}/pods/{pod uid}/volume-subpaths/{volume name}/{container name}/{index}
,(index是subpath在container的volumeMounts中出现序号,从0开始)
kubelet先在将secret、configmap的subpath内容保存到主机上目录(路径在..{time}/{subpath}),然后创建subpath对应的空文件,最后使用bind将subpath内容挂载到subpath对应的空文件上。
而bind挂载使用比较特殊方式:将subpath内容的fd(文件描述符)bind挂载到subpath对应的空文件上。
感兴趣的可以阅读kubernetes 1.33的源码 doBindSubPath
具体步骤:
- 打开secret、configmap目录下对应的subpath文件,得到fd
- 将这个fd作为源,subpath空文件作为目标,使用bind的方式挂载
- 然后再次执行remount
- 关闭fd
对应这个例子中:
- 保存configmap的config.conf数据到
/data/kubernetes/kubelet/po
ds/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf
- 创建空文件
/data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
- 打开
/data/kubernetes/kubelet/po
ds/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf
,获得文件的fd - 将这个fd使用bind方式挂载到
/data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
- 执行remount操作
- 关闭fd
2.1 案例分析
kubelet的日志看到subpath进行了bind mount
Jun 04 13:29:02 kube-master-01 kubelet[20136]: I0604 13:29:02.913922 20136 subpath_linux.go:232] bind mounting "/proc/20136/fd/19" at "/da
ta/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0"
Jun 04 13:29:02 kube-master-01 kubelet[20136]: I0604 13:29:02.913966 20136 mount_linux.go:260] Mounting cmd (mount) with arguments (--no-c
anonicalize -o bind /proc/20136/fd/19 /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0)
Jun 04 13:29:02 kube-master-01 kubelet[20136]: I0604 13:29:02.917275 20136 mount_linux.go:260] Mounting cmd (mount) with arguments (--no-c
anonicalize -o bind,remount,noatime /proc/20136/fd/19 /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/tes
t-config/test/0)
Jun 04 13:29:02 kube-master-01 kubelet[20136]: I0604 13:29:02.920550 20136 subpath_linux.go:238] Bound SubPath /data/kubernetes/kubelet/po
ds/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf into /data/
kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
Jun 04 13:29:02 kube-master-01 kubelet[20136]: I0604 13:29:02.921777 20136 subpath_linux.go:203] subpath "/data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf" is still mounted, mounts: [/data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0]
Jun 04 13:29:02 kube-master-01 kubelet[20136]: I0604 13:29:02.921855 20136 kubelet_pods.go:382] "Mount has propagation" pod="kube-system/test" containerName="test" volumeMountName="test-config" propagation="PROPAGATION_PRIVATE"
使用audit监控挂载和卸载系统调用
这里的/proc/20136/fd/19
为文件/data//kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf
的fd,inode为3283288
原来的/data/kubernetes/kubelet/pods/88a4f202-30e8-43d1-85ab-814a9369e8f9/volume-subpaths/test-config/test/0
的inode为3283365,挂载之后变成3283288
而runc操作/data/kubernetes/kubelet/pods/88a4f202-30e8-43d1-85ab-814a9369e8f9/volume-subpaths/test-config/test/0
挂载到容器的时候,inode变成3283365(未挂载时候的空文件的inode)
#设置监控mount、umount系统调用
#auditctl -a always,exit -F arch=b64 -S mount,umount2 -F key=mount_operations
#查询挂载记录
# ausearch -k mount_operations | less
#这个执行bind mount
time->Wed Jun 4 13:00:29 2025
type=PROCTITLE msg=audit(1749042029.732:830): proctitle=6D6F756E74002D2D6E6F2D63616E6F6E6963616C697A65002D6F0062696E64002F70726F632F32303133
362F66642F3139002F646174612F6B756265726E657465732F6B7562656C65742F706F64732F38386134663230322D333065382D343364312D383561622D3831346139333639
653866392F766F6C756D652D73756270
type=PATH msg=audit(1749042029.732:830): item=1 name="/proc/20136/fd/19" inode=3283288 dev=08:11 mode=0100644 ouid=0 ogid=0 rdev=00:00 namet
ype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1749042029.732:830): item=0 name="/data/kubernetes/kubelet/pods/88a4f202-30e8-43d1-85ab-814a9369e8f9/volume-subpaths/test-config/test/0" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749042029.732:830): cwd="/"
type=SYSCALL msg=audit(1749042029.732:830): arch=c000003e syscall=165 success=yes exit=0 a0=55b2728bb990 a1=55b2728bb9b0 a2=55b2728bba20 a3=1000 items=2 ppid=20136 pid=27630 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mount" exe="/usr/bin/mount" subj=unconfined key="mount_operations"
----
#这个是执行bind remount
time->Wed Jun 4 13:00:29 2025
type=PROCTITLE msg=audit(1749042029.736:831): proctitle=6D6F756E74002D2D6E6F2D63616E6F6E6963616C697A65002D6F0062696E642C72656D6F756E742C6E6F6174696D65002F70726F632F32303133362F66642F3139002F646174612F6B756265726E657465732F6B7562656C65742F706F64732F38386134663230322D333065382D343364312D383561622D3831346139333639
type=PATH msg=audit(1749042029.736:831): item=0 name="/data/kubernetes/kubelet/pods/88a4f202-30e8-43d1-85ab-814a9369e8f9/volume-subpaths/test-config/test/0" inode=3283288 dev=08:11 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749042029.736:831): cwd="/"
type=SYSCALL msg=audit(1749042029.736:831): arch=c000003e syscall=165 success=yes exit=0 a0=59c7ea4eba50 a1=59c7ea4eba70 a2=59c7ea4ebae0 a3=1420 items=1 ppid=20136 pid=27631 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mount" exe="/usr/bin/mount" subj=unconfined key="mount_operations"
#runc操作时候inode变成3283365(未挂载时候,原始空文件的inode)
time->Wed Jun 4 13:00:30 2025
type=PROCTITLE msg=audit(1749042030.001:873): proctitle=72756E6300696E6974
type=PATH msg=audit(1749042030.001:873): item=1 name="/data/kubernetes/kubelet/pods/88a4f202-30e8-43d1-85ab-814a9369e8f9/volume-subpaths/test-config/test/0" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1749042030.001:873): item=0 name="/proc/thread-self/fd/8" inode=3283395 dev=00:3c mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749042030.001:873): cwd="/run/containerd/io.containerd.runtime.v2.task/k8s.io/121b99326ba68e3578e29f419e3f996518996faecf7eec227ba5211cafd97493/rootfs"
type=SYSCALL msg=audit(1749042030.001:873): arch=c000003e syscall=165 success=yes exit=0 a0=c00009a310 a1=c0000c95d8 a2=c000192214 a3=5001 items=2 ppid=27639 pid=27651 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc:[2:INIT]" exe="/runc" subj=unconfined key="mount_operations"
----
time->Wed Jun 4 13:00:30 2025
type=PROCTITLE msg=audit(1749042030.001:875): proctitle=72756E6300696E6974
type=PATH msg=audit(1749042030.001:875): item=0 name="/proc/thread-self/fd/8" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749042030.001:875): cwd="/run/containerd/io.containerd.runtime.v2.task/k8s.io/121b99326ba68e3578e29f419e3f996518996faecf7eec227ba5211cafd97493/rootfs"
type=SYSCALL msg=audit(1749042030.001:875): arch=c000003e syscall=165 success=yes exit=0 a0=c00019224c a1=c0000c96b0 a2=c00019224d a3=44000 items=1 ppid=27639 pid=27651 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc:[2:INIT]" exe="/runc" subj=unconfined key="mount_operations"
----
time->Wed Jun 4 13:00:30 2025
type=PROCTITLE msg=audit(1749042030.001:877): proctitle=72756E6300696E6974
type=PATH msg=audit(1749042030.001:877): item=0 name="/proc/thread-self/fd/8" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749042030.001:877): cwd="/run/containerd/io.containerd.runtime.v2.task/k8s.io/121b99326ba68e3578e29f419e3f996518996faecf7eec227ba5211cafd97493/rootfs"
type=SYSCALL msg=audit(1749042030.001:877): arch=c000003e syscall=165 success=yes exit=0 a0=c00019227c a1=c0000c9788 a2=c00019227d a3=5021 items=1 ppid=27639 pid=27651 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc:[2:INIT]" exe="/runc" subj=unconfined key="mount_operations"
----
time->Wed Jun 4 13:29:02 2025
type=PROCTITLE msg=audit(1749043742.915:1126): proctitle=6D6F756E74002D2D6E6F2D63616E6F6E6963616C697A65002D6F0062696E64002F70726F632F3230313
3362F66642F3139002F646174612F6B756265726E657465732F6B7562656C65742F706F64732F39386263316361662D613465392D343337612D613962362D383565383565646
5346363632F766F6C756D652D73756270
type=PATH msg=audit(1749043742.915:1126): item=1 name="/proc/20136/fd/19" inode=3283288 dev=08:11 mode=0100644 ouid=0 ogid=0 rdev=00:00 name
type=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1749043742.915:1126): item=0 name="/data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749043742.915:1126): cwd="/"
type=SYSCALL msg=audit(1749043742.915:1126): arch=c000003e syscall=165 success=yes exit=0 a0=5de043404990 a1=5de0434049b0 a2=5de043404a20 a3=1000 items=2 ppid=20136 pid=28700 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="mount" exe="/usr/bin/mount" subj=unconfined key="mount_operations"
time->Wed Jun 4 13:29:03 2025
type=PROCTITLE msg=audit(1749043743.038:1169): proctitle=72756E6300696E6974
type=PATH msg=audit(1749043743.038:1169): item=1 name="/data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1749043743.038:1169): item=0 name="/proc/thread-self/fd/8" inode=3283395 dev=00:3c mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749043743.038:1169): cwd="/run/containerd/io.containerd.runtime.v2.task/k8s.io/fda9387959e560161bb71dc777fc9844ba9fece65bda3fe9ee5f97007b121bd0/rootfs"
type=SYSCALL msg=audit(1749043743.038:1169): arch=c000003e syscall=165 success=yes exit=0 a0=c0001022a0 a1=c0001373c8 a2=c0000f3ed4 a3=5001 items=2 ppid=28709 pid=28721 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc:[2:INIT]" exe="/runc" subj=unconfined key="mount_operations"
time->Wed Jun 4 13:29:03 2025
type=PROCTITLE msg=audit(1749043743.038:1171): proctitle=72756E6300696E6974
type=PATH msg=audit(1749043743.038:1171): item=0 name="/proc/thread-self/fd/8" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749043743.038:1171): cwd="/run/containerd/io.containerd.runtime.v2.task/k8s.io/fda9387959e560161bb71dc777fc9844ba9fece65bda3fe9ee5f97007b121bd0/rootfs"
type=SYSCALL msg=audit(1749043743.038:1171): arch=c000003e syscall=165 success=yes exit=0 a0=c0000f3f0c a1=c0001374a0 a2=c0000f3f0d a3=44000 items=1 ppid=28709 pid=28721 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc:[2:INIT]" exe="/runc" subj=unconfined key="mount_operations"
----
time->Wed Jun 4 13:29:03 2025
type=PROCTITLE msg=audit(1749043743.039:1173): proctitle=72756E6300696E6974
type=PATH msg=audit(1749043743.039:1173): item=0 name="/proc/thread-self/fd/8" inode=3283365 dev=08:11 mode=0100640 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1749043743.039:1173): cwd="/run/containerd/io.containerd.runtime.v2.task/k8s.io/fda9387959e560161bb71dc777fc9844ba9fece65bda3fe9ee5f97007b121bd0/rootfs"
type=SYSCALL msg=audit(1749043743.039:1173): arch=c000003e syscall=165 success=yes exit=0 a0=c0000f3f3c a1=c000137578 a2=c0000f3f3d a3=5021 items=1 ppid=28709 pid=28721 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc:[2:INIT]" exe="/runc" subj=unconfined key="mount_operations"
ppid 20136是kubelet进程
ps aux |grep 20136
root 20136 3.5 1.7 2140944 70756 ? Ssl 10:38 6:41 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///run/containerd/containerd.sock --pod-infra-container-image=registry.k8s.io/pause:3.10 --root-dir=/data/kubernetes/kubelet --v=5
查看文件inode
# ll -ih /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
3283365 -rw-r----- 1 root root 0 Jun 4 13:29 /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
# ll -ih /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf
3283288 -rw-r--r-- 1 root root 1.4K Jun 4 13:29 /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf
查看文件mount
在系统命名空间里未发现挂载,而在kubelet进程中可以看到挂载
# findmnt /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
#在kubelet进程中查找mnt
# findmnt -N 20136 -o SOURCE,TARGET,PROPAGATION,OPTIONS,VFS-OPTIONS /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
SOURCE TARGET PROPAGATION OPTIONS VFS-OPTIONS
/dev/sdb1[/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf]
/data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0
shared,slave rw,noat rw,noatime
3 为什么
通过上面分析可以发现,在kubelet进程中可以看到subpath是挂载的,而从shell进程里是看不到挂载的。
这里就涉及到了mount的namespaces propagation问题。
根本原因是mount namespace不在一个namespace里,且kubelet namespace里的subpath挂载点propagation是slave,shared
(不会向父group进行传播,而父group会传播到kubelet命名空间)。所以containerd、shell在系统命名空间中,看不到subpath是被挂载的,subpath在容器中挂载是空文件。
# ll /proc/20136/ns/mnt #kubelet进程的mount namespace
lrwxrwxrwx 1 root root 0 Jun 5 02:56 /proc/20136/ns/mnt -> 'mnt:[4026532279]'
# ll /proc/1/ns/mnt
lrwxrwxrwx 1 root root 0 Jun 5 02:55 /proc/1/ns/mnt -> 'mnt:[4026531841]'
# ll /proc/self/ns/mnt
lrwxrwxrwx 1 root root 0 Jun 5 02:57 /proc/self/ns/mnt -> 'mnt:[4026531841]'
#cat /proc/1/mountinfo |grep "159 "
62 25 8:17 / /data rw,noatime shared:159 - ext4 /dev/sdb1 rw
#cat /proc/20316/mountinfo |grep "159 "
867 805 8:17 / /data rw,noatime shared:492 master:159 - ext4 /dev/sdb1 rw
478 867 8:17 /kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volumes/kubernetes.io~configmap/test-config/..2025_06_04_13_29_02.3910074484/config.conf /data/kubernetes/kubelet/pods/98bc1caf-a4e9-437a-a9b6-85e85ede4ccc/volume-subpaths/test-config/test/0 rw,noatime shared:492 master:159 - ext4 /dev/sdb1 rw
3.1 为什么kubelet的mount命名空间不一样呢?
在systemd中kubelet输出的日志,默认发送到journal,而journal的默认会转发到syslog,造成重复的日志记录,浪费磁盘。为了节约磁盘,我定义了一个不转发到syslog的journal log namespace,然后配置kubelet使用这个log Namespace。
[Service]
LogNamespace=no-syslog-wall
而启用systemd的LogNamespace后,会将服务运行在新的mount namespace中而且设置propagation为slave,这个mount namespace会挂载journal相关的日志socket
#cat /proc/20316/mountinfo
908 846 0:28 /systemd/journal.no-syslog-wall /run/systemd/journal ro,nosuid,nodev shared:426 master:12 - tmpfs tmpfs rw,size=801376k,nr_inodes=819200,mode=755,inode64
systemd的文档中给出了解释
Internally, journal namespaces are implemented through Linux mount namespacing and over-mounting the directory that contains the relevant
AF_UNIX
sockets used for logging in the unit’s mount namespace. Since mount namespaces are used this setting disconnects propagation of mounts from the unit’s processes to the host, similarly to howReadOnlyPaths=
and similar settings describe above work. Journal namespaces may hence not be used for services that need to establish mount points on the host.
https://www.freedesktop.org/software/systemd/man/255/systemd.exec.html#LogNamespace=
3.2 解决方案
不使用LogNamespace,这样kubelet和containerd在系统mount命名空间里,containerd就能看到kubelet的挂载。
4 总结
在systemd中容器相关的服务,会涉及mount相关的目录操作,而这些操作需要让container runtime感知到。而kubelet中的cadvisor也会依赖container runtime中的挂载操作。
所以必须让kubelet和container runtime能互相感知挂载操作,propagation为share(所有父级挂载点必须也是share)或在同一命名空间里。
在网上也看到有人在docker中使用LogNamespace产生的类似问题,issues/41879。
5 Reference
https://github.com/systemd/systemd/issues/16638
https://lwn.net/Articles/689856/