Troubleshooting kubelet no container metric issues

Background: There are two versions of Kubernetes, 1.18.20 and 1.21.8. The kubelet is started using systemd with the same startup parameters. In Kubernetes v1.21, there is no container-related data, while in v1.18, there is monitoring data related to containers. The operating system is CentOS 7.

Observation: In the v1.21 cluster, the Grafana monitoring graphs for pods do not display any data related to containers, such as CPU and memory usage.

  1. Query container-related metrics (e.g., container_threads) in Prometheus, but no relevant data is found.

  2. Check kubelet logs and discover error messages:

    1
    2
    
    E0120 05:16:28.045934  660625 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/system.slice/kubelet.service\": RecentStats:unable to find data in memory cache]"
    E0120 05:16:28.046002  660625 summary_sys_containers.go:82] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/kubelet.service\": failed to get container info for \"/system.slice/kubelet.service\": partial failures: [\"/system.slice/kubelet.service\": RecentStats: unable to find data in memory cache]" containerName="/system.slice/kubelet.service"
    
  3. Modify systemd startup parameters by removing --kubelet-cgroups, but the error persists, and container monitoring data is still missing.

    Original parameters:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    
    [Unit]
    Description=Kubernetes Kubelet
    Documentation=https://github.com/GoogleCloudPlatform/kubernetes
    Wants=docker.service containerd.service
    After=docker.service
    
    [Service]
    WorkingDirectory=/data/kubernetes/kubelet
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/hugetlb/system.slice/kubelet.service
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service
    ExecStartPre=/bin/rm -rf /data/kubernetes/kubelet/cpu_manager_state
    #TasksAccounting=true
    #CPUAccounting=true
    #MemoryAccounting=true
    ExecStart=/usr/local/bin/kubelet \
      --node-status-update-frequency=10s \
      --anonymous-auth=false \
      --authentication-token-webhook \
      --authorization-mode=Webhook \
      --client-ca-file=/etc/kubernetes/ssl/ca.pem \
      --cluster-dns=10.61.128.2 \
      --cluster-domain=cluster.local \
      --cpu-manager-policy=static \
      --network-plugin=kubenet \
      --cloud-provider=external \
      --non-masquerade-cidr=0.0.0.0/0 \
      --hairpin-mode hairpin-veth \
      --cni-bin-dir=/opt/cni/bin \
      --cni-conf-dir=/etc/cni/net.d \
      --hostname-override=10.60.64.49 \
      --kubeconfig=/etc/kubernetes/kubelet.kubeconfig \
      --max-pods=125 \
      --pod-infra-container-image=xxx.com/k8s-google-containers/pause-amd64:3.2 \
      --register-node=true \
      --register-with-taints=install=unfinished:NoSchedule \
      --root-dir=/data/kubernetes/kubelet \
      --tls-cert-file=/etc/kubernetes/ssl/kubelet.pem \
      --tls-private-key-file=/etc/kubernetes/ssl/kubelet-key.pem \
      --cgroups-per-qos=true \
      --cgroup-driver=systemd \
      --enforce-node-allocatable=pods \
      --kubelet-cgroups=/system.slice/kubelet.service \
      --kube-reserved-cgroup=/system.slice \
      --system-reserved-cgroup=/system.slice \
      --kube-reserved=cpu=200m,memory=1Gi,ephemeral-storage=1Gi \
      --system-reserved=cpu=200m,memory=1Gi,ephemeral-storage=1Gi \
      --eviction-hard=nodefs.available<10%,nodefs.inodesFree<2%,memory.available<1Gi \
      --v=2 \
      --eviction-soft=nodefs.available<15%,nodefs.inodesFree<5%,memory.available<2Gi \
      --eviction-soft-grace-period=memory.available=1m,nodefs.available=1m,nodefs.inodesFree=1m \
      --eviction-max-pod-grace-period=120 \
      --serialize-image-pulls=false \
      --log-dir=/data/log/kubernetes/ \
      --logtostderr=false
    Restart=always
    RestartSec=5
    
    [Install]
    WantedBy=multi-user.target
    
  4. “I conducted a Google search for the issue and found a related GitHub issue. I had encountered the same problem before (I ran into the same issue twice 😂).

    At that time, to resolve this problem, I followed the solution mentioned in the issue, which was to set --kubelet-cgroups=/systemd/system.slice/kubelet.service. Later, upon inspecting the source code, I realized that if the --kubelet-cgroups path is set, the kubelet process is moved to that cgroup. I was concerned that this cgroup path might not match the path where systemd creates the kubelet service, which is /system.slice/kubelet.service, and this inconsistency might lead to abnormal issues. As a result, I reverted the --kubelet-cgroups parameter to its original value, and it ran without problems in version 1.18.”

I set --kubelet-cgroups to /fix-kubelet-no-cadvisor-monitor.service and restarted kubelet.

Previously, I was only focused on solving the problem. Although I later read the related code, I didn’t delve into the details or fully understand the underlying principles. Therefore, I spent some time later to investigate why this issue occurred.

  1. Carefully searched the logs again to look for any related entries.
  2. Examined the relevant source code based on the log errors.
  3. Reproduced the problem and set the log level to 6.

I discovered these clues:

  1. There was a warning log line in the kubelet logs:

    1
    
    W0120 04:32:48.944092  660625 container.go:586] Failed to update stats for container "/system.slice/kubelet.service": /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus found to be empty, continuing to push stats
    
  2. In the kubelet of version 1.18, the content of /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus was also empty.

  3. On both the 1.18 and 1.21 machines, the /sys/fs/cgroup/cpuset/system.slice/ directory contained only the kubelet.service directory and not the docker.service directory.

  1. Why is there this warning log in version 1.21, but not in version 1.18?
  2. Why is there no container monitoring?
  3. Why does the /sys/fs/cgroup/cpuset/system.slice/ directory only have the kubelet service’s cgroup?

First, some background knowledge: kubelet embeds cadvisor code to obtain container monitoring data. Cadvisor only monitors all hierarchical directories under the cgroup root directory /sys/fs/cgroup/{sub system}, and there are four factories (“containerd,” “systemd,” “raw,” “docker”) that can handle these cgroup data. The raw factory handles paths under /, /system.slice/kubelet.service, /system.slice/docker.service, and /kubepods.slice (paths are relative to the cgroup subsystem). Cadvisor starts a goroutine periodically to update monitoring data for each hierarchical directory.

Handling cgroup logs::

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
I0120 04:32:48.926910  660625 factory.go:220] Factory "containerd" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926917  660625 factory.go:45] /kubepods.slice/kubepods-besteffort.slice not handled by systemd handler
I0120 04:32:48.926919  660625 factory.go:220] Factory "systemd" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926923  660625 factory.go:220] Factory "docker" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926927  660625 factory.go:216] Using factory "raw" for container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926989  660625 container.go:527] Start housekeeping for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod993e5f25_9257_43f8_80da_db8185e7338c.slice/docker-3328875e5f0215a060e04e9ecd5dc33c88b21713cde00a4cf7899562125d4e48.scope"
I0120 04:32:48.927134  660625 manager.go:987] Added container: "/kubepods.slice/kubepods-besteffort.slice" (aliases: [], namespace: "")
I0120 04:32:48.927293  660625 handler.go:325] Added event &{/kubepods.slice/kubepods-besteffort.slice 2021-12-28 09:48:34.799412935 +0000 UTC containerCreation {<nil>}}
I0120 04:32:48.927341  660625 container.go:527] Start housekeeping for container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.943313  660625 factory.go:220] Factory "containerd" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943318  660625 factory.go:45] /system.slice/kubelet.service not handled by systemd handler
I0120 04:32:48.943320  660625 factory.go:220] Factory "systemd" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943324  660625 factory.go:220] Factory "docker" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943329  660625 factory.go:216] Using factory "raw" for container "/system.slice/kubelet.service"
I0120 04:32:48.943639  660625 manager.go:987] Added container: "/system.slice/kubelet.service" (aliases: [], namespace: "")
I0120 04:32:48.943886  660625 handler.go:325] Added event &{/system.slice/kubelet.service 2021-12-28 09:48:34.672413802 +0000 UTC containerCreation {<nil>}}

vendor\github.com\google\cadvisor\manager\manager.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
func (m *manager) createContainerLocked(containerName string, watchSource watcher.ContainerWatchSource) error {
	namespacedName := namespacedContainerName{
		Name: containerName,
	}

	// Check that the container didn't already exist.
	if _, ok := m.containers[namespacedName]; ok {
		return nil
	}

	handler, accept, err := container.NewContainerHandler(containerName, watchSource, m.inHostNamespace)
	if err != nil {
		return err
	}
	if !accept {
		// ignoring this container.
		klog.V(4).Infof("ignoring container %q", containerName)
		return nil
	}
	collectorManager, err := collector.NewCollectorManager()
	if err != nil {
		return err
	}

	logUsage := *logCadvisorUsage && containerName == m.cadvisorContainer
	cont, err := newContainerData(containerName, m.memoryCache, handler, logUsage, collectorManager, m.maxHousekeepingInterval, m.allowDynamicHousekeeping, clock.RealClock{})
	if err != nil {
		return err
	}
	....
	// Start the container's housekeeping.
	return cont.Start()

vendor\github.com\google\cadvisor\container\factory.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Create a new ContainerHandler for the specified container.
func NewContainerHandler(name string, watchType watcher.ContainerWatchSource, inHostNamespace bool) (ContainerHandler, bool, error) {
	factoriesLock.RLock()
	defer factoriesLock.RUnlock()

	// Create the ContainerHandler with the first factory that supports it.
	for _, factory := range factories[watchType] {
		canHandle, canAccept, err := factory.CanHandleAndAccept(name)
		if err != nil {
			klog.V(4).Infof("Error trying to work out if we can handle %s: %v", name, err)
		}
		if canHandle {
			if !canAccept {
				klog.V(3).Infof("Factory %q can handle container %q, but ignoring.", factory, name)
				return nil, false, nil
			}
			klog.V(3).Infof("Using factory %q for container %q", factory, name)
			handle, err := factory.NewContainerHandler(name, inHostNamespace)
			return handle, canAccept, err
		}
		klog.V(4).Infof("Factory %q was unable to handle container %q", factory, name)
	}

	return nil, false, fmt.Errorf("no known factory can handle creation of container")
}

Raw Factory Evaluation Code

The code that evaluates whether the raw factory can handle a container is found in vendor\github.com\google\cadvisor\container\raw\factory.go. It checks if the container’s name matches certain conditions, and if it does, it can handle the container. Here is the code snippet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// The raw factory can handle any container. If --docker_only is set to true, non-docker containers are ignored except for "/" and those whitelisted by raw_cgroup_prefix_whitelist flag.
func (f *rawFactory) CanHandleAndAccept(name string) (bool, bool, error) {
	if name == "/" {
		return true, true, nil
	}
	if *dockerOnly && f.rawPrefixWhiteList[0] == "" {
		return true, false, nil
	}
	for _, prefix := range f.rawPrefixWhiteList { //这里的f.rawPrefixWhiteList为[]string{"/kubepods.slice", "/system.slice/kubelet.service", "/system.slice/docker.service"}
		if strings.HasPrefix(name, prefix) {
			return true, true, nil
		}
	}
	return true, false, nil
}

The log you mentioned indicates that during the housekeeping cycle for /system.slice/kubelet.service, there was an error while trying to obtain monitoring data for its cpuset cgroup subsystem.

The log, W0120 04:32:48.944092 660625 container.go:586] Failed to update stats for container "/system.slice/kubelet.service": /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus found to be empty, continuing to push stats, corresponds to the code in vendor\github.com\google\cadvisor\manager\container.go.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func (cd *containerData) housekeepingTick(timer <-chan time.Time, longHousekeeping time.Duration) bool {
	select {
	case <-cd.stop:
		// Stop housekeeping when signaled.
		return false
	case finishedChan := <-cd.onDemandChan:
		// notify the calling function once housekeeping has completed
		defer close(finishedChan)
	case <-timer:
	}
	start := cd.clock.Now()
	err := cd.updateStats()
	if err != nil {
		if cd.allowErrorLogging() {
			klog.Warningf("Failed to update stats for container \"%s\": %s", cd.info.Name, err)
		}
	}

The cd.updateStats() function calls the GetStats() method of the appropriate handler. Here, we are discussing the handler for /system.slice/kubelet.service, which is the rawContainerHandler. GetStats() performs the following tasks:

  1. It retrieves various container stats, including CPU, memory, hugetlb, pids, blkio, network stats (sent and received), total process count, total FD count, total socket FD count, thread count, thread limit, and file system stats (e.g., disk size, available space, inode count, available inode count).
  2. It adds the container’s status data to the InMemoryCache.
  3. It updates aggregate data in the summaryReader to generate minute, hour, and day aggregation data.

The warning log indicates that there was an error in cd.handler.GetStats(), which, when encountering an error, returns directly without pushing stats to memory.

vendor\github.com\google\cadvisor\manager\container.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func (cd *containerData) updateStats() error {
	stats, statsErr := cd.handler.GetStats()
	if statsErr != nil {
		// Ignore errors if the container is dead.
		if !cd.handler.Exists() {
			return nil
		}

		// Stats may be partially populated, push those before we return an error.
		statsErr = fmt.Errorf("%v, continuing to push stats", statsErr)
	}
	if stats == nil {
		return statsErr
	}

finally at (s *CpusetGroup) GetStats return error

vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
	var err error

	stats.CPUSetStats.CPUs, err = getCpusetStat(path, "cpuset.cpus")
	if err != nil && !errors.Is(err, os.ErrNotExist) {
		return err
	}
	.....
}

func getCpusetStat(path string, filename string) ([]uint16, error) {
	var extracted []uint16
	fileContent, err := fscommon.GetCgroupParamString(path, filename)
	if err != nil {
		return extracted, err
	}
	if len(fileContent) == 0 {
		return extracted, fmt.Errorf("%s found to be empty", filepath.Join(path, filename))
	}

The absence of this log in version 1.18 is because it did not read any data from the cpuset subsystem.

vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go

1
2
3
func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
	return nil
}

This change was introduced in version 1.21 with this commit when cadvisor v0.39.0 was upgraded.

The /metrics/cadvisor endpoint of kubelet contains container-related monitoring data. However, when the code encounters an error in c.infoProvider.GetRequestedContainersInfo("/", c.opts), it returns directly without registering Prometheus metrics, causing the /metrics/cadvisor endpoint to return without container-related monitoring data.

Here are some relevant log entries:

1
2
W0120 04:33:03.423637  660625 prometheus.go:1856] Couldn't get containers: partial failures: ["/system.slice/kubelet.service": containerDataToContainerInfo: unable to find data in memory cache]
I0120 04:33:03.423895  660625 httplog.go:94] "HTTP" verb="GET" URI="/metrics/cadvisor" latency="24.970131ms" userAgent="Prometheus/2.27.1" srcIP="10.61.1.10:35224" resp=200

relate code

vendor\github.com\google\cadvisor\metrics\prometheus.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
func (c *PrometheusCollector) collectContainersInfo(ch chan<- prometheus.Metric) {
	containers, err := c.infoProvider.GetRequestedContainersInfo("/", c.opts)
	if err != nil {
		c.errors.Set(1)
		klog.Warningf("Couldn't get containers: %s", err)
		return
	}
	rawLabels := map[string]struct{}{}
	for _, container := range containers {
		for l := range c.containerLabelsFunc(container) {
			rawLabels[l] = struct{}{}
		}
	}
    ....
    // Container spec
		desc := prometheus.NewDesc("container_start_time_seconds", "Start time of the container since unix epoch in seconds.", labels, nil)
		ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.CreationTime.Unix()), values...)

		if cont.Spec.HasCpu {
			desc = prometheus.NewDesc("container_spec_cpu_period", "CPU period of the container.", labels, nil)
			ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Period), values...)
			if cont.Spec.Cpu.Quota != 0 {
				desc = prometheus.NewDesc("container_spec_cpu_quota", "CPU quota of the container.", labels, nil)
				ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Quota), values...)
			}
			desc := prometheus.NewDesc("container_spec_cpu_shares", "CPU share of the container.", labels, nil)
			ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Limit), values...)

		}
    ....

finally return error

vendor\github.com\google\cadvisor\manager\manager.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
func (m *manager) GetRequestedContainersInfo(containerName string, options v2.RequestOptions) (map[string]*info.ContainerInfo, error) {
	containers, err := m.getRequestedContainers(containerName, options)
	if err != nil {
		return nil, err
	}
	var errs partialFailure
	containersMap := make(map[string]*info.ContainerInfo)
	query := info.ContainerInfoRequest{
		NumStats: options.Count,
	}
	for name, data := range containers {
		info, err := m.containerDataToContainerInfo(data, &query)
		if err != nil {
			errs.append(name, "containerDataToContainerInfo", err)
		}
		containersMap[name] = info
	}
	return containersMap, errs.OrNil()
}

func (m *manager) containerDataToContainerInfo(cont *containerData, query *info.ContainerInfoRequest) (*info.ContainerInfo, error) {
	// Get the info from the container.
	cinfo, err := cont.GetInfo(true)
	if err != nil {
		return nil, err
	}

	stats, err := m.memoryCache.RecentStats(cinfo.Name, query.Start, query.End, query.NumStats)
	if err != nil {
		return nil, err
	}

	// Make a copy of the info for the user.
	ret := &info.ContainerInfo{
		ContainerReference: cinfo.ContainerReference,
		Subcontainers:      cinfo.Subcontainers,
		Spec:               m.getAdjustedSpec(cinfo),
		Stats:              stats,
	}
	return ret, nil
}

Due to the empty /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus, an error occurs, causing /system.slice/kubelet.service to lack monitoring data in memory. Consequently, RecentStats cannot retrieve the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// ErrDataNotFound is the error resulting if failed to find a container in memory cache.
var ErrDataNotFound = errors.New("unable to find data in memory cache")

func (c *InMemoryCache) RecentStats(name string, start, end time.Time, maxStats int) ([]*info.ContainerStats, error) {
	var cstore *containerCache
	var ok bool
	err := func() error {
		c.lock.RLock()
		defer c.lock.RUnlock()
		if cstore, ok = c.containerCacheMap[name]; !ok {
			return ErrDataNotFound
		}
		return nil
	}()
	if err != nil {
		return nil, err
	}

	return cstore.RecentStats(start, end, maxStats)
}

Initially, it was suspected that the kubelet.service file was configured with special parameters, causing the kubelet.service directory to appear in /sys/fs/cgroup/cpuset/system.slice.

However, after realizing that CentOS 7’s systemd version 2.19 does not support the cpuset cgroup and docker.service file, it was discovered that the configuration ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service was responsible for creating the kubelet.service directory in /sys/fs/cgroup/cpuset/system.slice.

After commenting out this configuration and restarting, the issue was resolved.

Upon removing ExecStartPre, the following tests were conducted:

Configuration Version 1.18 Version 1.21
No kubelet-cgroup configuration cpuset directory: /system.slice/kubelet.service is absent cpuset directory: /system.slice/kubelet.service is absent
kubelet-cgroup configured as /system.slice/kubelet.service cpuset directory: /system.slice/kubelet.service is absent cpuset directory: /system.slice/kubelet.service is absent
kubelet-cgroup configured as /fix-kubelet-no-cadvisor-monitor.service cpuset directory: /fix-kubelet-no-cadvisor-monitor.service exists, and cpuset.cpus is not empty cpuset directory: /fix-kubelet-no-cadvisor-monitor.service exists, and cpuset.cpus is not empty

When the kubelet-cgroup configuration path matches the cgroup path of the kubelet process (value in /proc/self/cgroup), no action is taken.

If they do not match, directories are created in the respective cgroup subsystems, and the kubelet process’s PID is added to this cgroup directory. The cpuset.cpus and cpuset.mems values inherit from /sys/fs/cgroup/cpuset/cpuset.cpus.

The relevant code for this behavior is present in the 1.18

vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// 递归的创建dir父目录
// 如果递归路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为其父目录里对应文件的值
// 创建dir目录
// cgroup里设置了CpusetCpus和CpusetMems值,则写入dir目录下"cpuset.cpus"和"cpuset.mems"文件
// 如果path路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为父目录里对应文件的值
// 将pid写入(覆盖)到dir下的"cgroup.procs"文件,最多重试5次
func (s *CpusetGroup) ApplyDir(dir string, cgroup *configs.Cgroup, pid int) error {
	// This might happen if we have no cpuset cgroup mounted.
	// Just do nothing and don't fail.
	if dir == "" {
		return nil
	}
	mountInfo, err := ioutil.ReadFile("/proc/self/mountinfo")
	if err != nil {
		return err
	}
	// 从mountinfo中获取最长的匹配dir的挂载路径的父目录
	root := filepath.Dir(cgroups.GetClosestMountpointAncestor(dir, string(mountInfo)))
	// 'ensureParent' start with parent because we don't want to
	// explicitly inherit from parent, it could conflict with
	// 'cpuset.cpu_exclusive'.
	// 递归的创建dir父目录
	// 如果递归路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为其父目录里对应文件的值
	if err := s.ensureParent(filepath.Dir(dir), root); err != nil {
		return err
	}
	if err := os.MkdirAll(dir, 0755); err != nil {
		return err
	}
	// We didn't inherit cpuset configs from parent, but we have
	// to ensure cpuset configs are set before moving task into the
	// cgroup.
	// The logic is, if user specified cpuset configs, use these
	// specified configs, otherwise, inherit from parent. This makes
	// cpuset configs work correctly with 'cpuset.cpu_exclusive', and
	// keep backward compatibility.
	// cgroup里设置了CpusetCpus和CpusetMems值,则写入dir目录下"cpuset.cpus"和"cpuset.mems"文件
	// 如果path路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为父目录里对应文件的值
	if err := s.ensureCpusAndMems(dir, cgroup); err != nil {
		return err
	}

	// because we are not using d.join we need to place the pid into the procs file
	// unlike the other subsystems
	// 将pid写入(覆盖)到dir下的"cgroup.procs"文件,最多重试5次
	// 如果dir目录下"cpuset.cpus"和"cpuset.mems"文件为空,则写入dir下的"cgroup.procs"文件会报错"write error: No space left on device"
	return cgroups.WriteCgroupProc(dir, pid)
}

func (s *CpusetGroup) Apply(d *cgroupData) error {
	// 返回cpuset的cgroup绝对路径,比如"/sys/fs/cgroup/cpuset/system.slice/kubelet.service"
	dir, err := d.path("cpuset")
	if err != nil && !cgroups.IsNotFound(err) {
		return err
	}
	return s.ApplyDir(dir, d.config, d.pid)
}

pkg\kubelet\cm\container_manager_linux.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// 如果manager不为空,则会将kubelet进程移动到manager.Cgroups.Name为路径的cgroup,否则跳过
// 然后设置kubelet的oom_score_adj为-999
// 如果manager不为空,则会将dockerd进程移动到manager.Cgroups.Name为路径的cgroup,否则跳过
// 然后设置dockerd的oom_score_adj为-999
func ensureProcessInContainerWithOOMScore(pid int, oomScoreAdj int, manager *fs.Manager) error {
	if runningInHost, err := isProcessRunningInHost(pid); err != nil {
		// Err on the side of caution. Avoid moving the docker daemon unless we are able to identify its context.
		return err
	} else if !runningInHost {
		// Process is running inside a container. Don't touch that.
		klog.V(2).Infof("pid %d is not running in the host namespaces", pid)
		return nil
	}

	var errs []error
	if manager != nil {
		// 获得kubelt进程的cgroup路径,比如/system.slice/kubelet.service
		cont, err := getContainer(pid)
		if err != nil {
			errs = append(errs, fmt.Errorf("failed to find container of PID %d: %v", pid, err))
		}

		// kubelet进程的cgroup路径与manager.Cgroups.Name( 即cm.KubeletCgroupsName,配置了--kubelet-cgroups或配置文件KubeletCgroups)不一致,则创建指定的cgroup路径,并设置各个子系统的属性值
		if cont != manager.Cgroups.Name {
			// 创建各个cgroup系统目录,并将pid加入到这个cgroup中
			// 其中cpuset、cpu、memory有特殊处理,具体看vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\apply_raw.go
			err = manager.Apply(pid)
			if err != nil {
				errs = append(errs, fmt.Errorf("failed to move PID %d (in %q) to %q: %v", pid, cont, manager.Cgroups.Name, err))
			}
		}
	}

	// Also apply oom-score-adj to processes
	oomAdjuster := oom.NewOOMAdjuster()
	klog.V(5).Infof("attempting to apply oom_score_adj of %d to pid %d", oomScoreAdj, pid)
	// 将oomScoreAdj值写入到/proc/<pid>/oom_score_adj
	if err := oomAdjuster.ApplyOOMScoreAdj(pid, oomScoreAdj); err != nil {
		klog.V(3).Infof("Failed to apply oom_score_adj %d for pid %d: %v", oomScoreAdj, pid, err)
		errs = append(errs, fmt.Errorf("failed to apply oom score %d to PID %d: %v", oomScoreAdj, pid, err))
	}
	return utilerrors.NewAggregate(errs)
}

The issue was caused by the kubelet’s systemd configuration file setting ExecStartPre to create the kubelet.service cpuset cgroup directory, while manually created cpuset cgroup directories had empty values for both “cpuset.cpus” and “cpuset.mems”. Additionally, if the kubelet-cgroup configuration path matched the cgroup path of the kubelet process (found in /proc/self/cgroup), the kubelet did not take any action regarding the cpuset directory.

In the 1.21 version of the cadvisor library, it attempted to read the “cpuset.cpus” of the cpuset cgroup subsystem, and when this value was empty, it raised an error. This error caused the lack of monitoring data for /system.slice/kubelet.service in memory.

Furthermore, the /metrics/cadvisor endpoint needed to read monitoring data for /system.slice/kubelet.service from memory. Without this data, it raised an error and returned no container monitoring data.

Related Content