kubelet没有container metric指标排查过程

背景: 有两个版本的kubernetes 1.18.20和1.21.8,kubelet使用systemd启动并且相同启动参数,kubernetes v1.21版本中没有容器相关数据,而v1.18版本确有容器相关监控数据。操作系统是centos7。

现象:1.21版本的集群 pod的grafana监控图中容器的相关的cpu、内存等没有数据

  1. 去Prometheus上查询container相关的metric(比如container_threads),没有任何相关的数据

  2. 查看kubelet日志发现报错日志

    1
    2
    
    E0120 05:16:28.045934  660625 cadvisor_stats_provider.go:415] "Partial failure issuing cadvisor.ContainerInfoV2" err="partial failures: [\"/system.slice/kubelet.service\": RecentStats:unable to find data in memory cache]"
    E0120 05:16:28.046002  660625 summary_sys_containers.go:82] "Failed to get system container stats" err="failed to get cgroup stats for \"/system.slice/kubelet.service\": failed to get container info for \"/system.slice/kubelet.service\": partial failures: [\"/system.slice/kubelet.service\": RecentStats: unable to find data in memory cache]" containerName="/system.slice/kubelet.service"
    
  3. 修改systemd启动参数,将–kubelet-cgroups去掉,发现还是一样报错和没有容器监控数据

    原来的参数

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    
    [Unit]
    Description=Kubernetes Kubelet
    Documentation=https://github.com/GoogleCloudPlatform/kubernetes
    Wants=docker.service containerd.service
    After=docker.service
    
    [Service]
    WorkingDirectory=/data/kubernetes/kubelet
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/hugetlb/system.slice/kubelet.service
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service
    ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service
    ExecStartPre=/bin/rm -rf /data/kubernetes/kubelet/cpu_manager_state
    #TasksAccounting=true
    #CPUAccounting=true
    #MemoryAccounting=true
    ExecStart=/usr/local/bin/kubelet \
      --node-status-update-frequency=10s \
      --anonymous-auth=false \
      --authentication-token-webhook \
      --authorization-mode=Webhook \
      --client-ca-file=/etc/kubernetes/ssl/ca.pem \
      --cluster-dns=10.61.128.2 \
      --cluster-domain=cluster.local \
      --cpu-manager-policy=static \
      --network-plugin=kubenet \
      --cloud-provider=external \
      --non-masquerade-cidr=0.0.0.0/0 \
      --hairpin-mode hairpin-veth \
      --cni-bin-dir=/opt/cni/bin \
      --cni-conf-dir=/etc/cni/net.d \
      --hostname-override=10.60.64.49 \
      --kubeconfig=/etc/kubernetes/kubelet.kubeconfig \
      --max-pods=125 \
      --pod-infra-container-image=xxx.com/k8s-google-containers/pause-amd64:3.2 \
      --register-node=true \
      --register-with-taints=install=unfinished:NoSchedule \
      --root-dir=/data/kubernetes/kubelet \
      --tls-cert-file=/etc/kubernetes/ssl/kubelet.pem \
      --tls-private-key-file=/etc/kubernetes/ssl/kubelet-key.pem \
      --cgroups-per-qos=true \
      --cgroup-driver=systemd \
      --enforce-node-allocatable=pods \
      --kubelet-cgroups=/system.slice/kubelet.service \
      --kube-reserved-cgroup=/system.slice \
      --system-reserved-cgroup=/system.slice \
      --kube-reserved=cpu=200m,memory=1Gi,ephemeral-storage=1Gi \
      --system-reserved=cpu=200m,memory=1Gi,ephemeral-storage=1Gi \
      --eviction-hard=nodefs.available<10%,nodefs.inodesFree<2%,memory.available<1Gi \
      --v=2 \
      --eviction-soft=nodefs.available<15%,nodefs.inodesFree<5%,memory.available<2Gi \
      --eviction-soft-grace-period=memory.available=1m,nodefs.available=1m,nodefs.inodesFree=1m \
      --eviction-max-pod-grace-period=120 \
      --serialize-image-pulls=false \
      --log-dir=/data/log/kubernetes/ \
      --logtostderr=false
    Restart=always
    RestartSec=5
    
    [Install]
    WantedBy=multi-user.target
    
  4. Google查询问题,发现有相关issue,以前也碰到过这个问题(相同的问题遇到两次😂)。

    当时为了解决这个问题,按照issue里的解决方法,设置 --kubelet-cgroups=/systemd/system.slice/kubelet.service,后面看源码,如果--kubelet--cgroups路径设置了,则将kubelet进程移动到这个cgroup。担心这个cgroup路径与systemd的创建kubelet服务所在路径/system.slice/kubelet.service不一致会有异常问题,把安装模板里--kubelet-cgroups改回来,并在1.18版本里使用这个参数运行没有问题。

--kubelet-cgroups设置为/fix-kubelet-no-cadvisor-monitor.service,重启kubelet。

以前只是为了解决问题,虽然后面读了相关代码,但并未深入细节并理解原理。所以后面花了一些时间,去研究为什么会出现这个问题。

  1. 再次仔细搜索日志,查找是否有相关联的日志
  2. 根据日志报错查看相关源代码
  3. 复现问题并把日志级别设置为6

发现了这几个线索:

  1. kubelet日志中有一行警告日志

    1
    
    W0120 04:32:48.944092  660625 container.go:586] Failed to update stats for container "/system.slice/kubelet.service": /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus found to be empty, continuing to push stats
    
  2. 1.18版本的kubelet的机器里/sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus内容也是空的

  3. 1.18版本的机器和1.21版本的机器上的/sys/fs/cgroup/cpuset/system.slice/目录下面只有kubelet.service目录,没有docker.service目录

  1. 为什么1.21版本上会有这条warn日志,而1.18版本上没有这条warn日志?
  2. 为什么没有容器监控?
  3. 为什么/sys/fs/cgroup/cpuset/system.slice/目录只有kubelet服务的cgroup?

先说一下背景知识,kubelet中内嵌了cadvisor代码来获取容器的监控数据,cadvisor只会监控cgroup根目录/sys/fs/cgroup/{sub system}下面所有层级目录,有4种factory(“containerd”、“systemd”、“raw”、“docker”)可以处理这些cgroup数据。raw factory处理//system.slice/kubelet.service/system.slice/docker.service/kubepods.slice下的所有路径(路径为相对cgroup子系统的相对路径)。cadvisor会为每一层级目录,启动一个goroutetine周期来更新的监控数据。

处理cgroup日志:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
I0120 04:32:48.926910  660625 factory.go:220] Factory "containerd" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926917  660625 factory.go:45] /kubepods.slice/kubepods-besteffort.slice not handled by systemd handler
I0120 04:32:48.926919  660625 factory.go:220] Factory "systemd" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926923  660625 factory.go:220] Factory "docker" was unable to handle container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926927  660625 factory.go:216] Using factory "raw" for container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.926989  660625 container.go:527] Start housekeeping for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod993e5f25_9257_43f8_80da_db8185e7338c.slice/docker-3328875e5f0215a060e04e9ecd5dc33c88b21713cde00a4cf7899562125d4e48.scope"
I0120 04:32:48.927134  660625 manager.go:987] Added container: "/kubepods.slice/kubepods-besteffort.slice" (aliases: [], namespace: "")
I0120 04:32:48.927293  660625 handler.go:325] Added event &{/kubepods.slice/kubepods-besteffort.slice 2021-12-28 09:48:34.799412935 +0000 UTC containerCreation {<nil>}}
I0120 04:32:48.927341  660625 container.go:527] Start housekeeping for container "/kubepods.slice/kubepods-besteffort.slice"
I0120 04:32:48.943313  660625 factory.go:220] Factory "containerd" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943318  660625 factory.go:45] /system.slice/kubelet.service not handled by systemd handler
I0120 04:32:48.943320  660625 factory.go:220] Factory "systemd" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943324  660625 factory.go:220] Factory "docker" was unable to handle container "/system.slice/kubelet.service"
I0120 04:32:48.943329  660625 factory.go:216] Using factory "raw" for container "/system.slice/kubelet.service"
I0120 04:32:48.943639  660625 manager.go:987] Added container: "/system.slice/kubelet.service" (aliases: [], namespace: "")
I0120 04:32:48.943886  660625 handler.go:325] Added event &{/system.slice/kubelet.service 2021-12-28 09:48:34.672413802 +0000 UTC containerCreation {<nil>}}

vendor\github.com\google\cadvisor\manager\manager.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
func (m *manager) createContainerLocked(containerName string, watchSource watcher.ContainerWatchSource) error {
	namespacedName := namespacedContainerName{
		Name: containerName,
	}

	// Check that the container didn't already exist.
	if _, ok := m.containers[namespacedName]; ok {
		return nil
	}

	handler, accept, err := container.NewContainerHandler(containerName, watchSource, m.inHostNamespace)
	if err != nil {
		return err
	}
	if !accept {
		// ignoring this container.
		klog.V(4).Infof("ignoring container %q", containerName)
		return nil
	}
	collectorManager, err := collector.NewCollectorManager()
	if err != nil {
		return err
	}

	logUsage := *logCadvisorUsage && containerName == m.cadvisorContainer
	cont, err := newContainerData(containerName, m.memoryCache, handler, logUsage, collectorManager, m.maxHousekeepingInterval, m.allowDynamicHousekeeping, clock.RealClock{})
	if err != nil {
		return err
	}
	....
	// Start the container's housekeeping.
	return cont.Start()

vendor\github.com\google\cadvisor\container\factory.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Create a new ContainerHandler for the specified container.
func NewContainerHandler(name string, watchType watcher.ContainerWatchSource, inHostNamespace bool) (ContainerHandler, bool, error) {
	factoriesLock.RLock()
	defer factoriesLock.RUnlock()

	// Create the ContainerHandler with the first factory that supports it.
	for _, factory := range factories[watchType] {
		canHandle, canAccept, err := factory.CanHandleAndAccept(name)
		if err != nil {
			klog.V(4).Infof("Error trying to work out if we can handle %s: %v", name, err)
		}
		if canHandle {
			if !canAccept {
				klog.V(3).Infof("Factory %q can handle container %q, but ignoring.", factory, name)
				return nil, false, nil
			}
			klog.V(3).Infof("Using factory %q for container %q", factory, name)
			handle, err := factory.NewContainerHandler(name, inHostNamespace)
			return handle, canAccept, err
		}
		klog.V(4).Infof("Factory %q was unable to handle container %q", factory, name)
	}

	return nil, false, fmt.Errorf("no known factory can handle creation of container")
}

raw factory判断代码

vendor\github.com\google\cadvisor\container\raw\factory.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// The raw factory can handle any container. If --docker_only is set to true, non-docker containers are ignored except for "/" and those whitelisted by raw_cgroup_prefix_whitelist flag.
func (f *rawFactory) CanHandleAndAccept(name string) (bool, bool, error) {
	if name == "/" {
		return true, true, nil
	}
	if *dockerOnly && f.rawPrefixWhiteList[0] == "" {
		return true, false, nil
	}
	for _, prefix := range f.rawPrefixWhiteList { //这里的f.rawPrefixWhiteList为[]string{"/kubepods.slice", "/system.slice/kubelet.service", "/system.slice/docker.service"}
		if strings.HasPrefix(name, prefix) {
			return true, true, nil
		}
	}
	return true, false, nil
}

这条日志说明在/system.slice/kubelet.service的housekeeping周期中,获取其cpuset cgroup子系统的监控数据发生了错误。

这条日志W0120 04:32:48.944092 660625 container.go:586] Failed to update stats for container "/system.slice/kubelet.service": /sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cpus found to be empty, continuing to push stats对应的代码在vendor\github.com\google\cadvisor\manager\container.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func (cd *containerData) housekeepingTick(timer <-chan time.Time, longHousekeeping time.Duration) bool {
	select {
	case <-cd.stop:
		// Stop housekeeping when signaled.
		return false
	case finishedChan := <-cd.onDemandChan:
		// notify the calling function once housekeeping has completed
		defer close(finishedChan)
	case <-timer:
	}
	start := cd.clock.Now()
	err := cd.updateStats()
	if err != nil {
		if cd.allowErrorLogging() {
			klog.Warningf("Failed to update stats for container \"%s\": %s", cd.info.Name, err)
		}
	}

cd.updateStats()会调用相应的handler的GetStats(),这里只讨论处理/system.slice/kubelet.service的handler–rawContainerHandler。GetStats()会做以下工作:

  1. 获取container的cpu、memory、hugetlb、pids、blkio、网卡发送接收、所有进程的数量、所有进程总的FD数量、FD中的总socket数量、Threads数量、Threads限制、文件系统状态(比如磁盘大小、剩余空间、inode数量、inode可用数量)状态

  2. 添加container状态数据到InMemoryCache中

  3. 更新聚合数据summaryReader,生成分钟、小时、天的聚合数据

日志warning报错,说明cd.handler.GetStats()出错了,出错后直接返回,会导致内存中没有/system.slice/kubelet.service的监控数据(后续保存监控数据到内存的代码不执行)。

vendor\github.com\google\cadvisor\manager\container.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
func (cd *containerData) updateStats() error {
	stats, statsErr := cd.handler.GetStats()
	if statsErr != nil {
		// Ignore errors if the container is dead.
		if !cd.handler.Exists() {
			return nil
		}

		// Stats may be partially populated, push those before we return an error.
		statsErr = fmt.Errorf("%v, continuing to push stats", statsErr)
	}
	if stats == nil {
		return statsErr
	}

最终在 (s *CpusetGroup) GetStats返回错误

vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
	var err error

	stats.CPUSetStats.CPUs, err = getCpusetStat(path, "cpuset.cpus")
	if err != nil && !errors.Is(err, os.ErrNotExist) {
		return err
	}
	.....
}

func getCpusetStat(path string, filename string) ([]uint16, error) {
	var extracted []uint16
	fileContent, err := fscommon.GetCgroupParamString(path, filename)
	if err != nil {
		return extracted, err
	}
	if len(fileContent) == 0 {
		return extracted, fmt.Errorf("%s found to be empty", filepath.Join(path, filename))
	}

因为1.18中对cpuset子系统不会读取任何数据

vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go

1
2
3
func (s *CpusetGroup) GetStats(path string, stats *cgroups.Stats) error {
	return nil
}

这个变化是在这个1.21版本的这个commit升级cadvisor v0.39.0引入的。

kubelet的/metrics/cadvisor接口里包含了容器的相关监控数据。而代码在c.infoProvider.GetRequestedContainersInfo("/", c.opts)报错后直接返回,而不执行下面的注册Prometheus metric,导致接口/metrics/cadvisor返回没有容器相关的监控数据。

相关日志

1
2
W0120 04:33:03.423637  660625 prometheus.go:1856] Couldn't get containers: partial failures: ["/system.slice/kubelet.service": containerDataToContainerInfo: unable to find data in memory cache]
I0120 04:33:03.423895  660625 httplog.go:94] "HTTP" verb="GET" URI="/metrics/cadvisor" latency="24.970131ms" userAgent="Prometheus/2.27.1" srcIP="10.61.1.10:35224" resp=200

相关代码

vendor\github.com\google\cadvisor\metrics\prometheus.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
func (c *PrometheusCollector) collectContainersInfo(ch chan<- prometheus.Metric) {
	containers, err := c.infoProvider.GetRequestedContainersInfo("/", c.opts)
	if err != nil {
		c.errors.Set(1)
		klog.Warningf("Couldn't get containers: %s", err)
		return
	}
	rawLabels := map[string]struct{}{}
	for _, container := range containers {
		for l := range c.containerLabelsFunc(container) {
			rawLabels[l] = struct{}{}
		}
	}
    ....
    // Container spec
		desc := prometheus.NewDesc("container_start_time_seconds", "Start time of the container since unix epoch in seconds.", labels, nil)
		ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.CreationTime.Unix()), values...)

		if cont.Spec.HasCpu {
			desc = prometheus.NewDesc("container_spec_cpu_period", "CPU period of the container.", labels, nil)
			ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Period), values...)
			if cont.Spec.Cpu.Quota != 0 {
				desc = prometheus.NewDesc("container_spec_cpu_quota", "CPU quota of the container.", labels, nil)
				ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Quota), values...)
			}
			desc := prometheus.NewDesc("container_spec_cpu_shares", "CPU share of the container.", labels, nil)
			ch <- prometheus.MustNewConstMetric(desc, prometheus.GaugeValue, float64(cont.Spec.Cpu.Limit), values...)

		}
    ....

最终报错在

vendor\github.com\google\cadvisor\manager\manager.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
func (m *manager) GetRequestedContainersInfo(containerName string, options v2.RequestOptions) (map[string]*info.ContainerInfo, error) {
	containers, err := m.getRequestedContainers(containerName, options)
	if err != nil {
		return nil, err
	}
	var errs partialFailure
	containersMap := make(map[string]*info.ContainerInfo)
	query := info.ContainerInfoRequest{
		NumStats: options.Count,
	}
	for name, data := range containers {
		info, err := m.containerDataToContainerInfo(data, &query)
		if err != nil {
			errs.append(name, "containerDataToContainerInfo", err)
		}
		containersMap[name] = info
	}
	return containersMap, errs.OrNil()
}

func (m *manager) containerDataToContainerInfo(cont *containerData, query *info.ContainerInfoRequest) (*info.ContainerInfo, error) {
	// Get the info from the container.
	cinfo, err := cont.GetInfo(true)
	if err != nil {
		return nil, err
	}

	stats, err := m.memoryCache.RecentStats(cinfo.Name, query.Start, query.End, query.NumStats)
	if err != nil {
		return nil, err
	}

	// Make a copy of the info for the user.
	ret := &info.ContainerInfo{
		ContainerReference: cinfo.ContainerReference,
		Subcontainers:      cinfo.Subcontainers,
		Spec:               m.getAdjustedSpec(cinfo),
		Stats:              stats,
	}
	return ret, nil
}

由于/sys/fs/cgroup/cpuset/system.slice/kubelet.service/cpuset.cups为空报错,导致/system.slice/kubelet.service在内存中没有监控数据,所以RecentStats取不到数据

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// ErrDataNotFound is the error resulting if failed to find a container in memory cache.
var ErrDataNotFound = errors.New("unable to find data in memory cache")

func (c *InMemoryCache) RecentStats(name string, start, end time.Time, maxStats int) ([]*info.ContainerStats, error) {
	var cstore *containerCache
	var ok bool
	err := func() error {
		c.lock.RLock()
		defer c.lock.RUnlock()
		if cstore, ok = c.containerCacheMap[name]; !ok {
			return ErrDataNotFound
		}
		return nil
	}()
	if err != nil {
		return nil, err
	}

	return cstore.RecentStats(start, end, maxStats)
}

最开始怀疑是kubelet.service文件配置了某些特殊参数,导致在/sys/fs/cgroup/cpuset/system.slicekubelet.service目录。

但是centos7的systemd版本是2.19不支持cpuset cgroup和docker.service文件之后,发现ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service是这个配置导致了在/sys/fs/cgroup/cpuset/system.slicekubelet.service目录。

在复现的机器上注释掉这个配置,重启之后发现问题解决了。

在去掉ExecStartPre后,进行如下测试:

配置 1.18版本 1.21版本
不配置kubelet-cgroup cpuset目录没有/system.slice/kubelet.service cpuset目录没有/system.slice/kubelet.service
配置kubelet-cgroup为/system.slice/kubelet.service cpuset目录没有/system.slice/kubelet.service cpuset目录没有/system.slice/kubelet.service
配置kubelet-cgroup为/fix-kubelet-no-cadvisor-monitor.service cpuset目录下有/fix-kubelet-no-cadvisor-monitor.service且cpuset.cpus不为空 cpuset目录下有/fix-kubelet-no-cadvisor-monitor.service且cpuset.cpus不为空

当kubelet-cgroup配置的路径与kubelet进程的cgroup路径(/proc/self/cgroup里的值)一致,则不采取任何动作。

不一致的话会在各个cgroup子系统下创建相应的目录,并将kubelet进程的pid加入到这个cgroup目录中。其中cpuset.cpus和cpuset.mems值会从/sys/fs/cgroup/cpuset/cpuset.cpus继承。

相关代码为

这个是1.18版本的代码

vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\cpuset.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// 递归的创建dir父目录
// 如果递归路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为其父目录里对应文件的值
// 创建dir目录
// cgroup里设置了CpusetCpus和CpusetMems值,则写入dir目录下"cpuset.cpus"和"cpuset.mems"文件
// 如果path路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为父目录里对应文件的值
// 将pid写入(覆盖)到dir下的"cgroup.procs"文件,最多重试5次
func (s *CpusetGroup) ApplyDir(dir string, cgroup *configs.Cgroup, pid int) error {
	// This might happen if we have no cpuset cgroup mounted.
	// Just do nothing and don't fail.
	if dir == "" {
		return nil
	}
	mountInfo, err := ioutil.ReadFile("/proc/self/mountinfo")
	if err != nil {
		return err
	}
	// 从mountinfo中获取最长的匹配dir的挂载路径的父目录
	root := filepath.Dir(cgroups.GetClosestMountpointAncestor(dir, string(mountInfo)))
	// 'ensureParent' start with parent because we don't want to
	// explicitly inherit from parent, it could conflict with
	// 'cpuset.cpu_exclusive'.
	// 递归的创建dir父目录
	// 如果递归路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为其父目录里对应文件的值
	if err := s.ensureParent(filepath.Dir(dir), root); err != nil {
		return err
	}
	if err := os.MkdirAll(dir, 0755); err != nil {
		return err
	}
	// We didn't inherit cpuset configs from parent, but we have
	// to ensure cpuset configs are set before moving task into the
	// cgroup.
	// The logic is, if user specified cpuset configs, use these
	// specified configs, otherwise, inherit from parent. This makes
	// cpuset configs work correctly with 'cpuset.cpu_exclusive', and
	// keep backward compatibility.
	// cgroup里设置了CpusetCpus和CpusetMems值,则写入dir目录下"cpuset.cpus"和"cpuset.mems"文件
	// 如果path路径下面的"cpuset.cpus"和"cpuset.mems"文件的内容为空,则设置为父目录里对应文件的值
	if err := s.ensureCpusAndMems(dir, cgroup); err != nil {
		return err
	}

	// because we are not using d.join we need to place the pid into the procs file
	// unlike the other subsystems
	// 将pid写入(覆盖)到dir下的"cgroup.procs"文件,最多重试5次
	// 如果dir目录下"cpuset.cpus"和"cpuset.mems"文件为空,则写入dir下的"cgroup.procs"文件会报错"write error: No space left on device"
	return cgroups.WriteCgroupProc(dir, pid)
}

func (s *CpusetGroup) Apply(d *cgroupData) error {
	// 返回cpuset的cgroup绝对路径,比如"/sys/fs/cgroup/cpuset/system.slice/kubelet.service"
	dir, err := d.path("cpuset")
	if err != nil && !cgroups.IsNotFound(err) {
		return err
	}
	return s.ApplyDir(dir, d.config, d.pid)
}

pkg\kubelet\cm\container_manager_linux.go

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// 如果manager不为空,则会将kubelet进程移动到manager.Cgroups.Name为路径的cgroup,否则跳过
// 然后设置kubelet的oom_score_adj为-999
// 如果manager不为空,则会将dockerd进程移动到manager.Cgroups.Name为路径的cgroup,否则跳过
// 然后设置dockerd的oom_score_adj为-999
func ensureProcessInContainerWithOOMScore(pid int, oomScoreAdj int, manager *fs.Manager) error {
	if runningInHost, err := isProcessRunningInHost(pid); err != nil {
		// Err on the side of caution. Avoid moving the docker daemon unless we are able to identify its context.
		return err
	} else if !runningInHost {
		// Process is running inside a container. Don't touch that.
		klog.V(2).Infof("pid %d is not running in the host namespaces", pid)
		return nil
	}

	var errs []error
	if manager != nil {
		// 获得kubelt进程的cgroup路径,比如/system.slice/kubelet.service
		cont, err := getContainer(pid)
		if err != nil {
			errs = append(errs, fmt.Errorf("failed to find container of PID %d: %v", pid, err))
		}

		// kubelet进程的cgroup路径与manager.Cgroups.Name( 即cm.KubeletCgroupsName,配置了--kubelet-cgroups或配置文件KubeletCgroups)不一致,则创建指定的cgroup路径,并设置各个子系统的属性值
		if cont != manager.Cgroups.Name {
			// 创建各个cgroup系统目录,并将pid加入到这个cgroup中
			// 其中cpuset、cpu、memory有特殊处理,具体看vendor\github.com\opencontainers\runc\libcontainer\cgroups\fs\apply_raw.go
			err = manager.Apply(pid)
			if err != nil {
				errs = append(errs, fmt.Errorf("failed to move PID %d (in %q) to %q: %v", pid, cont, manager.Cgroups.Name, err))
			}
		}
	}

	// Also apply oom-score-adj to processes
	oomAdjuster := oom.NewOOMAdjuster()
	klog.V(5).Infof("attempting to apply oom_score_adj of %d to pid %d", oomScoreAdj, pid)
	// 将oomScoreAdj值写入到/proc/<pid>/oom_score_adj
	if err := oomAdjuster.ApplyOOMScoreAdj(pid, oomScoreAdj); err != nil {
		klog.V(3).Infof("Failed to apply oom_score_adj %d for pid %d: %v", oomScoreAdj, pid, err)
		errs = append(errs, fmt.Errorf("failed to apply oom score %d to PID %d: %v", oomScoreAdj, pid, err))
	}
	return utilerrors.NewAggregate(errs)
}

由于kubelet的systemd配置文件设置了ExecStartPre创建kubelet.service的cpuset cgroup目录,而手动创建的cpuset cgroup目录的"cpuset.cpus"和"cpuset.mems"都是为空的。而且kubelet-cgroup配置的路径与kubelet进程的cgroup路径(/proc/self/cgroup里的值)一致,则kubelet对cpuset目录不采取任何动作。

而1.21版本的cadvisor库,会读取cpuset cgroup子系统的"cpuset.cpus",当"cpuset.cpus"内容为空时候会报错,导致内存中没有/system.slice/kubelet.service监控数据。

/metrics/cadvisor接口需要读取内存中/system.slice/kubelet.service监控数据,没有监控数据直接报错返回,导致没有容器监控数据。

相关内容