Occasionally, especially under high load, I have noticed kubernetes nodes go offline. They exhibit very strange behavior, such as terminal session freezing when issuing simple commands like ls or even complicated ones like top. For a while I had a suspicion the underlying issue was related to improper cgroup configuration. I have tried a few times to configure kubelet to properly work with these and things have gotten better however I felt like I was still missing a link. This resolved things like ssh sessions failing but not the commands issued within. At best, I have a cogent argument but nothing better to deal with the resource starvation.

Recently I stumbled across Prevent resource starvation of critical System and Kubernetes Services which at least speaks towards proper configuration. Effectively the articles essence is:

  1. CGroups are the kernel mechanism for resource allocation which is used by kubelet to manage pod resource usage.
  2. A properly configured kubernetes node will have the following cgroup slices.
    • /system.slice – Created via the SystemD system itself
    • /podruntime.slice – Should contain kubelet and the container runtime environment, which is containerd for me. This must be created by the user.
    • /kubepods.slice – Is where work loads are actually scheduled and controlled by kubelet.
  3. Kubelet should be configured with the following as hinted at by the design document
    • systemReservedCGroup should be /system.slice
    • kubeletCGroups should be /podruntime.slice
    • runtimeCGroups should be /kubepods.slice
  4. The following additional configuration need to occur within your host to make this viable:
    • /etc/systemd/system/podruntime.slice needs to contain the following:
      [Unit]
      Description=Limited resources slice for Kubernetes services
      Documentation=man:systemd.special(7)
      DefaultDependencies=no
      Before=slices.target
      Requires=-.slice
      After=-.slice
      
         * `/etc/systemd/system/kubelet.service.d/10-cgroup.conf` needs to contain the following: ```toml [Service] CPUAccounting=true MemoryAccounting=true Slice=podruntime.slice ```
         * `/etc/systemd/system/containerd.service.d/10-cgroup.conf` needs to contain the following: ```toml [Service] Slice=podruntime.slice ```
      

Very helpful were the debugging tools. In particular the systemd-cgls to list the CGroup hierarchy and the path /sys/fs/cgroup. Here I was able to verify kubelet was running under the /system.slice which is definitely not the intended segment. The files within /sys/fs/cgroup appear unit-less in most places. I stumbled across a Facebook post with a helpful reference which is a summary of information from the Linux Kernel. All memory units are in bytes and CPU units are in microseconds (1000 microseconds to 1 millisecond).

Next steps

Looks like I need to move the Pod runtime, both kubelet and containerd to work under /podruntime.slice and increase reserved resources for the base system daemons. Considering the ramifications of possibly providing reasonable minimums for the /system.slice too, so I can log into the box when kubelet dies off.