kubelet没有container metric指标排查过程
背景: 有两个版本的kubernetes 1.18.20和1.21.8,kubelet使用systemd启动并且相同启动参数,kubernetes v1.21版本中没有容器相关数据,而v1.18版本确有容器相关监控数据。操作系统是centos7。
现象:1.21版本的集群 pod的grafana监控图中容器的相关的cpu、内存等没有数据
1 排查过程:
去Prometheus上查询container相关的metric(比如
container_threads
),没有任何相关的数据查看kubelet日志发现报错日志
E0120 05:16:28.045934 660625 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/system.slice/kubelet.service\": RecentStats:unable to find data in memory cache]" E0120 05:16:28.046002 660625 summary_sys_containers.go:82] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/kubelet.service\": failed to get container info for \"/system.slice/kubelet.service\": partial failures: [\"/system.slice/kubelet.service\": RecentStats: unable to find data in memory cache]" containerName="/system.slice/kubelet.service"
修改systemd启动参数,将–kubelet-cgroups去掉,发现还是一样报错和没有容器监控数据
原来的参数
[Unit] Description=Kubernetes Kubelet Documentation=https://github.com/GoogleCloudPlatform/kubernetes Wants=docker.service containerd.service After=docker.service [Service] WorkingDirectory=/data/kubernetes/kubelet ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/hugetlb/system.slice/kubelet.service ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service ExecStartPre=/bin/rm -rf /data/kubernetes/kubelet/cpu_manager_state #TasksAccounting=true #CPUAccounting=true #MemoryAccounting=true ExecStart=/usr/local/bin/kubelet \ --node-status-update-frequency=10s \ --anonymous-auth=false \ --authentication-token-webhook \ --authorization-mode=Webhook \ --client-ca-file=/etc/kubernetes/ssl/ca.pem \ --cluster-dns=10.61.128.2 \ --cluster-domain=cluster.local \ --cpu-manager-policy=static \ --network-plugin=kubenet \ --cloud-provider=external \ --non-masquerade-cidr=0.0.0.0/0 \ --hairpin-mode hairpin-veth \ --cni-bin-dir=/opt/cni/bin \ --cni-conf-dir=/etc/cni/net.d \ --hostname-override=10.60.64.49 \ --kubeconfig=/etc/kubernetes/kubelet.kubeconfig \ --max-pods=125 \ --pod-infra-container-image=xxx.com/k8s-google-containers/pause-amd64:3.2 \ --register-node=true \ --register-with-taints=install=unfinished:NoSchedule \ --root-dir=/data/kubernetes/kubelet \ --tls-cert-file=/etc/kubernetes/ssl/kubelet.pem \ --tls-private-key-file=/etc/kubernetes/ssl/kubelet-key.pem \ --cgroups-per-qos=true \ --cgroup-driver=systemd \ --enforce-node-allocatable=pods \ --kubelet-cgroups=/system.slice/kubelet.service \ --kube-reserved-cgroup=/system.slice \ --system-reserved-cgroup=/system.slice \ --kube-reserved=cpu=200m,memory=1Gi,ephemeral-storage=1Gi \ --system-reserved=cpu=200m,memory=1Gi,ephemeral-storage=1Gi \ --eviction-hard=nodefs.available<10%,nodefs.inodesFree<2%,memory.available<1Gi \ --v=2 \ --eviction-soft=nodefs.available<15%,nodefs.inodesFree<5%,memory.available<2Gi \ --eviction-soft-grace-period=memory.available=1m,nodefs.available=1m,nodefs.inodesFree=1m \ --eviction-max-pod-grace-period=120 \ --serialize-image-pulls=false \ --log-dir=/data/log/kubernetes/ \ --logtostderr=false Restart=always RestartSec=5 [Install] WantedBy=multi-user.target
Google查询问题,发现有相关issue,以前也碰到过这个问题(相同的问题遇到两次😂)。
当时为了解决这个问题,按照issue里的解决方法,设置
--kubelet-cgroups=/systemd/system.slice/kubelet.service
,后面看源码,如果--kubelet--cgroups
路径设置了,则将kubelet进程移动到这个cgroup。担心这个cgroup路径与systemd的创建kubelet服务所在路径/system.slice/kubelet.service
不一致会有异常问题,把安装模板里--kubelet-cgroups
改回来,并在1.18版本里使用这个参数运行没有问题。
1.1 当时的解决方法
将 --kubelet-cgroups
设置为/fix-kubelet-no-cadvisor-monitor.service
,重启kubelet。
1.2 问题解决后反思
以前只是为了解决问题,虽然后面读了相关代码,但并未深入细节并理解原理。所以后面花了一些时间,去研究为什么会出现这个问题。
- 再次仔细搜索日志,查找是否有相关联的日志
- 根据日志报错查看相关源代码
- 复现问题并把日志级别设置为6
发现了这几个线索:
kubelet日志中有一行警告日志
W0120 04:32:48.944092 660625 container.go:586] Failed to update stats for container "/system.slice/kubelet.service": /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus found to be empty, continuing to push stats
1.18版本的kubelet的机器里
/sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus
内容也是空的1.18版本的机器和1.21版本的机器上的
/sys/fs/cgroup/cpuset/system.slice/
目录下面只有kubelet.service目录,没有docker.service目录
2 三个疑问
- 为什么1.21版本上会有这条warn日志,而1.18版本上没有这条warn日志?
- 为什么没有容器监控?
- 为什么
/sys/fs/cgroup/cpuset/system.slice/
目录只有kubelet服务的cgroup?
2.1 为什么会有这条warn日志
先说一下背景知识,kubelet中内嵌了cadvisor代码来获取容器的监控数据,cadvisor只会监控cgroup根目录/sys/fs/cgroup/{sub system}
下面所有层级目录,有4种factory(“containerd”、“systemd”、“raw”、“docker”)可以处理这些cgroup数据。raw factory处理/
和/system.slice/kubelet.service
和/system.slice/docker.service
和/kubepods.slice
下的所有路径(路径为相对cgroup子系统的相对路径)。cadvisor会为每一层级目录,启动一个goroutetine周期来更新的监控数据。
处理cgroup日志:
I0120 04:32:48.926910 660625 factory.go:220] Factory "containerd" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926917 660625 factory.go:45] /kubepods.slice/kubepods-besteffort.slice not handled by systemd handler
I0120 04:32:48.926919 660625 factory.go:220] Factory "systemd" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926923 660625 factory.go:220] Factory "docker" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926927 660625 factory.go:216] Using factory "raw" for container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926989 660625 container.go:527] Start housekeeping for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod993e5f25_9257_43f8_80da_db8185e7338c.slice/docker-3328875e5f0215a060e04e9ecd5dc33c88b21713cde00a4cf7899562125d4e48.scope"
I0120 04:32:48.927134 660625 manager.go:987] Added container: "/kubepods.slice/kubepods-besteffort.slice" (aliases: [], namespace: "")
I0120 04:32:48.927293 660625 handler.go:325] Added event &{/kubepods.slice/kubepods-besteffort.slice 2021-12-28 09:48:34.799412935 +0000 UTC containerCreation {<nil>}}
I0120 04:32:48.927341 660625 container.go:527] Start housekeeping for container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.943313 660625 factory.go:220] Factory "containerd" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943318 660625 factory.go:45] /system.slice/kubelet.service not handled by systemd handler
I0120 04:32:48.943320 660625 factory.go:220] Factory "systemd" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943324 660625 factory.go:220] Factory "docker" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943329 660625 factory.go:216] Using factory "raw" for container "/system.slice/kubelet.service"
I0120 04:32:48.943639 660625 manager.go:987] Added container: "/system.slice/kubelet.service" (aliases: [], namespace: "")
I0120 04:32:48.943886 660625 handler.go:325] Added event &{/system.slice/kubelet.service 2021-12-28 09:48:34.672413802 +0000 UTC containerCreation {<nil>}}
vendor\github.com\google\cadvisor\manager\manager.go
func (m *manager) createContainerLocked(containerName string, watchSource watcher.ContainerWatchSource) error {
namespacedName := namespacedContainerName{
Name: containerName,
}
// Check that the container didn't already exist.
if _, ok := m.containers[namespacedName]; ok {
return nil
}
handler, accept, err := container.NewContainerHandler(containerName, watchSource, m.inHostNamespace)
if err != nil {
return err
}
if !accept {
// ignoring this container.
klog.V(4).Infof("ignoring container %q", containerName)
return nil
}
collectorManager, err := collector.NewCollectorManager()
if err != nil {
return err
}
logUsage := *logCadvisorUsage && containerName == m.cadvisorContainer
cont, err := newContainerData(containerName, m.memoryCache, handler, logUsage, collectorManager, m.maxHousekeepingInterval, m.allowDynamicHousekeeping, clock.RealClock{})
if err != nil {
return err
}
....
// Start the container's housekeeping.
return cont.Start()
vendor\github.com\google\cadvisor\container\factory.go
// Create a new ContainerHandler for the specified container.
func NewContainerHandler(name string, watchType watcher.ContainerWatchSource, inHostNamespace bool) (ContainerHandler, bool, error) {
factoriesLock.RLock()
defer factoriesLock.RUnlock()
// Create the ContainerHandler with the first factory that supports it.
for _, factory := range factories[watchType] {
canHandle, canAccept, err := factory.CanHandleAndAccept(name)
if err != nil {
klog.V(4).Infof("Error trying to work out if we can handle %s: %v", name, err)
}
if canHandle {
if !canAccept {
klog.V(3).Infof("Factory %q can handle container %q, but ignoring.", factory, name)
return nil, false, nil
}
klog.V(3).Infof("Using factory %q for container %q", factory, name)
handle, err := factory.NewContainerHandler(name, inHostNamespace)
return handle, canAccept, err
}
klog.V(4).Infof("Factory %q was unable to handle container %q", factory, name)
}
return nil, false, fmt.Errorf("no known factory can handle creation of container")
}
raw factory判断代码
vendor\github.com\google\cadvisor\container\raw\factory.go
// The raw factory can handle any container. If --docker_only is set to true, non-docker containers are ignored except for "/" and those whitelisted by raw_cgroup_prefix_whitelist flag.
func (f *rawFactory) CanHandleAndAccept(name string) (bool, bool, error) {
if name == "/" {
return true, true, nil
}
if *dockerOnly && f.rawPrefixWhiteList[0] == "" {
return true, false, nil
}
for _, prefix := range f.rawPrefixWhiteList { //这里的f.rawPrefixWhiteList为[]string{"/kubepods.slice", "/system.slice/kubelet.service", "/system.slice/docker.service"}
if strings.HasPrefix(name, prefix) {
return true, true, nil
}
}
return true, false, nil
}
2.1.1 日志的含义
这条日志说明在/system.slice/kubelet.service
的housekeeping周期中,获取其cpuset cgroup子系统的监控数据发生了错误。
这条日志W0120 04:32:48.944092 660625 container.go:586] Failed to update stats for container "/system.slice/kubelet.service": /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus found to be empty, continuing to push stats
对应的代码在vendor\github.com\google\cadvisor\manager\container.go
func (cd *containerData) housekeepingTick(timer <-chan time.Time, longHousekeeping time.Duration) bool {
select {
case <-cd.stop:
// Stop housekeeping when signaled.
return false
case finishedChan := <-cd.onDemandChan:
// notify the calling function once housekeeping has completed
defer close(finishedChan)
case <-timer:
}
start := cd.clock.Now()
err := cd.updateStats()
if err != nil {
if cd.allowErrorLogging() {
klog.Warningf("Failed to update stats for container \"%s\": %s", cd.info.Name, err)
}
}
cd.updateStats()
会调用相应的handler的GetStats()
,这里只讨论处理/system.slice/kubelet.service
的handler–rawContainerHandler。GetStats()
会做以下工作:
获取container的cpu、memory、hugetlb、pids、blkio、网卡发送接收、所有进程的数量、所有进程总的FD数量、FD中的总socket数量、Threads数量、Threads限制、文件系统状态(比如磁盘大小、剩余空间、inode数量、inode可用数量)状态
添加container状态数据到InMemoryCache中
更新聚合数据summaryReader,生成分钟、小时、天的聚合数据
日志warning报错,说明cd.handler.GetStats()
出错了,出错后直接返回,会导致内存中没有/system.slice/kubelet.service
的监控数据(后续保存监控数据到内存的代码不执行)。
vendor\github.com\google\cadvisor\manager\container.go
func (cd *containerData) updateStats() error {
stats, statsErr := cd.handler.GetStats()
if statsErr != nil {
// Ignore errors if the container is dead.
if !cd.handler.Exists() {
return nil
}
// Stats may be partially populated, push those before we return an error.
statsErr = fmt.Errorf("%v, continuing to push stats", statsErr)
}
if stats == nil {
return statsErr
}
最终在 (s *CpusetGroup) GetStats
返回错误
vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go
func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
var err error
stats.CPUSetStats.CPUs, err = getCpusetStat(path, "cpuset.cpus")
if err != nil && !errors.Is(err, os.ErrNotExist) {
return err
}
.....
}
func getCpusetStat(path string, filename string) ([]uint16, error) {
var extracted []uint16
fileContent, err := fscommon.GetCgroupParamString(path, filename)
if err != nil {
return extracted, err
}
if len(fileContent) == 0 {
return extracted, fmt.Errorf("%s found to be empty", filepath.Join(path, filename))
}
2.1.2 为什么1.18版本中没有这个日志呢?
因为1.18中对cpuset子系统不会读取任何数据
vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go
func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
return nil
}
这个变化是在这个1.21版本的这个commit升级cadvisor v0.39.0引入的。
2.2 为什么没有容器监控数据?
kubelet的/metrics/cadvisor
接口里包含了容器的相关监控数据。而代码在c.infoProvider.GetRequestedContainersInfo("/", c.opts)
报错后直接返回,而不执行下面的注册Prometheus metric
,导致接口/metrics/cadvisor
返回没有容器相关的监控数据。
相关日志
W0120 04:33:03.423637 660625 prometheus.go:1856] Couldn't get containers: partial failures: ["/system.slice/kubelet.service": containerDataToContainerInfo: unable to find data in memory cache]
I0120 04:33:03.423895 660625 httplog.go:94] "HTTP" verb="GET" URI="/metrics/cadvisor" latency="24.970131ms" userAgent="Prometheus/2.27.1" srcIP="10.61.1.10:35224" resp=200
相关代码
vendor\github.com\google\cadvisor\metrics\prometheus.go
func (c *PrometheusCollector) collectContainersInfo(ch chan<- prometheus.Metric) {
containers, err := c.infoProvider.GetRequestedContainersInfo("/", c.opts)
if err != nil {
c.errors.Set(1)
klog.Warningf("Couldn't get containers: %s", err)
return
}
rawLabels := map[string]struct{}{}
for _, container := range containers {
for l := range c.containerLabelsFunc(container) {
rawLabels[l] = struct{}{}
}
}
....
// Container spec
desc := prometheus.NewDesc("container_start_time_seconds", "Start time of the container since unix epoch in seconds.", labels, nil)
ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.CreationTime.Unix()), values...)
if cont.Spec.HasCpu {
desc = prometheus.NewDesc("container_spec_cpu_period", "CPU period of the container.", labels, nil)
ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Period), values...)
if cont.Spec.Cpu.Quota != 0 {
desc = prometheus.NewDesc("container_spec_cpu_quota", "CPU quota of the container.", labels, nil)
ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Quota), values...)
}
desc := prometheus.NewDesc("container_spec_cpu_shares", "CPU share of the container.", labels, nil)
ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Limit), values...)
}
....
最终报错在
vendor\github.com\google\cadvisor\manager\manager.go
func (m *manager) GetRequestedContainersInfo(containerName string, options v2.RequestOptions) (map[string]*info.ContainerInfo, error) {
containers, err := m.getRequestedContainers(containerName, options)
if err != nil {
return nil, err
}
var errs partialFailure
containersMap := make(map[string]*info.ContainerInfo)
query := info.ContainerInfoRequest{
NumStats: options.Count,
}
for name, data := range containers {
info, err := m.containerDataToContainerInfo(data, &query)
if err != nil {
errs.append(name, "containerDataToContainerInfo", err)
}
containersMap[name] = info
}
return containersMap, errs.OrNil()
}
func (m *manager) containerDataToContainerInfo(cont *containerData, query *info.ContainerInfoRequest) (*info.ContainerInfo, error) {
// Get the info from the container.
cinfo, err := cont.GetInfo(true)
if err != nil {
return nil, err
}
stats, err := m.memoryCache.RecentStats(cinfo.Name, query.Start, query.End, query.NumStats)
if err != nil {
return nil, err
}
// Make a copy of the info for the user.
ret := &info.ContainerInfo{
ContainerReference: cinfo.ContainerReference,
Subcontainers: cinfo.Subcontainers,
Spec: m.getAdjustedSpec(cinfo),
Stats: stats,
}
return ret, nil
}
由于/sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cups
为空报错,导致/system.slice/kubelet.service
在内存中没有监控数据,所以RecentStats取不到数据
// ErrDataNotFound is the error resulting if failed to find a container in memory cache.
var ErrDataNotFound = errors.New("unable to find data in memory cache")
func (c *InMemoryCache) RecentStats(name string, start, end time.Time, maxStats int) ([]*info.ContainerStats, error) {
var cstore *containerCache
var ok bool
err := func() error {
c.lock.RLock()
defer c.lock.RUnlock()
if cstore, ok = c.containerCacheMap[name]; !ok {
return ErrDataNotFound
}
return nil
}()
if err != nil {
return nil, err
}
return cstore.RecentStats(start, end, maxStats)
}
2.2.1 为什么/sys/fs/cgroup/cpuset/system.slice/
目录只有kubelet服务的cgroup?
最开始怀疑是kubelet.service文件配置了某些特殊参数,导致在/sys/fs/cgroup/cpuset/system.slice
有kubelet.service
目录。
但是centos7的systemd版本是2.19不支持cpuset cgroup和docker.service文件之后,发现ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
是这个配置导致了在/sys/fs/cgroup/cpuset/system.slice
有kubelet.service
目录。
在复现的机器上注释掉这个配置,重启之后发现问题解决了。
在去掉ExecStartPre后,进行如下测试:
配置 | 1.18版本 | 1.21版本 |
---|---|---|
不配置kubelet-cgroup | cpuset目录没有/system.slice/kubelet.service | cpuset目录没有/system.slice/kubelet.service |
配置kubelet-cgroup为/system.slice/kubelet.service | cpuset目录没有/system.slice/kubelet.service | cpuset目录没有/system.slice/kubelet.service |
配置kubelet-cgroup为/fix-kubelet-no-cadvisor-monitor.service | cpuset目录下有/fix-kubelet-no-cadvisor-monitor.service且cpuset.cpus不为空 | cpuset目录下有/fix-kubelet-no-cadvisor-monitor.service且cpuset.cpus不为空 |
当kubelet-cgroup配置的路径与kubelet进程的cgroup路径(/proc/self/cgroup里的值)一致,则不采取任何动作。
不一致的话会在各个cgroup子系统下创建相应的目录,并将kubelet进程的pid加入到这个cgroup目录中。其中cpuset.cpus和cpuset.mems值会从/sys/fs/cgroup/cpuset/cpuset.cpus继承。
相关代码为
这个是1.18版本的代码
vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go
// 递归的创建dir父目录
// 如果递归路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为其父目录里对应文件的值
// 创建dir目录
// cgroup里设置了CpusetCpus和CpusetMems值,则写入dir目录下"cpuset.cpus"和"cpuset.mems"文件
// 如果path路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为父目录里对应文件的值
// 将pid写入(覆盖)到dir下的"cgroup.procs"文件,最多重试5次
func (s *CpusetGroup) ApplyDir(dir string, cgroup *configs.Cgroup, pid int) error {
// This might happen if we have no cpuset cgroup mounted.
// Just do nothing and don't fail.
if dir == "" {
return nil
}
mountInfo, err := ioutil.ReadFile("/proc/self/mountinfo")
if err != nil {
return err
}
// 从mountinfo中获取最长的匹配dir的挂载路径的父目录
root := filepath.Dir(cgroups.GetClosestMountpointAncestor(dir, string(mountInfo)))
// 'ensureParent' start with parent because we don't want to
// explicitly inherit from parent, it could conflict with
// 'cpuset.cpu_exclusive'.
// 递归的创建dir父目录
// 如果递归路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为其父目录里对应文件的值
if err := s.ensureParent(filepath.Dir(dir), root); err != nil {
return err
}
if err := os.MkdirAll(dir, 0755); err != nil {
return err
}
// We didn't inherit cpuset configs from parent, but we have
// to ensure cpuset configs are set before moving task into the
// cgroup.
// The logic is, if user specified cpuset configs, use these
// specified configs, otherwise, inherit from parent. This makes
// cpuset configs work correctly with 'cpuset.cpu_exclusive', and
// keep backward compatibility.
// cgroup里设置了CpusetCpus和CpusetMems值,则写入dir目录下"cpuset.cpus"和"cpuset.mems"文件
// 如果path路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为父目录里对应文件的值
if err := s.ensureCpusAndMems(dir, cgroup); err != nil {
return err
}
// because we are not using d.join we need to place the pid into the procs file
// unlike the other subsystems
// 将pid写入(覆盖)到dir下的"cgroup.procs"文件,最多重试5次
// 如果dir目录下"cpuset.cpus"和"cpuset.mems"文件为空,则写入dir下的"cgroup.procs"文件会报错"write error: No space left on device"
return cgroups.WriteCgroupProc(dir, pid)
}
func (s *CpusetGroup) Apply(d *cgroupData) error {
// 返回cpuset的cgroup绝对路径,比如"/sys/fs/cgroup/cpuset/system.slice/kubelet.service"
dir, err := d.path("cpuset")
if err != nil && !cgroups.IsNotFound(err) {
return err
}
return s.ApplyDir(dir, d.config, d.pid)
}
pkg\kubelet\cm\container_manager_linux.go
// 如果manager不为空,则会将kubelet进程移动到manager.Cgroups.Name为路径的cgroup,否则跳过
// 然后设置kubelet的oom_score_adj为-999
// 如果manager不为空,则会将dockerd进程移动到manager.Cgroups.Name为路径的cgroup,否则跳过
// 然后设置dockerd的oom_score_adj为-999
func ensureProcessInContainerWithOOMScore(pid int, oomScoreAdj int, manager *fs.Manager) error {
if runningInHost, err := isProcessRunningInHost(pid); err != nil {
// Err on the side of caution. Avoid moving the docker daemon unless we are able to identify its context.
return err
} else if !runningInHost {
// Process is running inside a container. Don't touch that.
klog.V(2).Infof("pid %d is not running in the host namespaces", pid)
return nil
}
var errs []error
if manager != nil {
// 获得kubelt进程的cgroup路径,比如/system.slice/kubelet.service
cont, err := getContainer(pid)
if err != nil {
errs = append(errs, fmt.Errorf("failed to find container of PID %d: %v", pid, err))
}
// kubelet进程的cgroup路径与manager.Cgroups.Name( 即cm.KubeletCgroupsName,配置了--kubelet-cgroups或配置文件KubeletCgroups)不一致,则创建指定的cgroup路径,并设置各个子系统的属性值
if cont != manager.Cgroups.Name {
// 创建各个cgroup系统目录,并将pid加入到这个cgroup中
// 其中cpuset、cpu、memory有特殊处理,具体看vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\apply_raw.go
err = manager.Apply(pid)
if err != nil {
errs = append(errs, fmt.Errorf("failed to move PID %d (in %q) to %q: %v", pid, cont, manager.Cgroups.Name, err))
}
}
}
// Also apply oom-score-adj to processes
oomAdjuster := oom.NewOOMAdjuster()
klog.V(5).Infof("attempting to apply oom_score_adj of %d to pid %d", oomScoreAdj, pid)
// 将oomScoreAdj值写入到/proc/<pid>/oom_score_adj
if err := oomAdjuster.ApplyOOMScoreAdj(pid, oomScoreAdj); err != nil {
klog.V(3).Infof("Failed to apply oom_score_adj %d for pid %d: %v", oomScoreAdj, pid, err)
errs = append(errs, fmt.Errorf("failed to apply oom score %d to PID %d: %v", oomScoreAdj, pid, err))
}
return utilerrors.NewAggregate(errs)
}
3 总结
由于kubelet的systemd配置文件设置了ExecStartPre创建kubelet.service的cpuset cgroup目录,而手动创建的cpuset cgroup目录的"cpuset.cpus"和"cpuset.mems"都是为空的。而且kubelet-cgroup配置的路径与kubelet进程的cgroup路径(/proc/self/cgroup里的值)一致,则kubelet对cpuset目录不采取任何动作。
而1.21版本的cadvisor库,会读取cpuset cgroup子系统的"cpuset.cpus",当"cpuset.cpus"内容为空时候会报错,导致内存中没有/system.slice/kubelet.service
监控数据。
而/metrics/cadvisor
接口需要读取内存中/system.slice/kubelet.service
监控数据,没有监控数据直接报错返回,导致没有容器监控数据。