CGroups, Kubernetes and Reliable Nodes
• Mark Eschbach
Occasionally, especially under high load, I have noticed kubernetes nodes go offline. They exhibit very strange
behavior, such as terminal session freezing when issuing simple commands like
ls or even complicated ones like
For a while I had a suspicion the underlying issue was related to improper cgroup configuration. I have tried a few
times to configure kubelet to properly work with these and things have gotten better however I felt like I was still
missing a link. This resolved things like
ssh sessions failing but not the commands issued within. At best, I have a
cogent argument but nothing better to deal with the resource starvation.
Recently I stumbled across Prevent resource starvation of critical System and Kubernetes Services which at least speaks towards proper configuration. Effectively the articles essence is:
- CGroups are the kernel mechanism for resource allocation which is used by
kubeletto manage pod resource usage.
- A properly configured kubernetes node will have the following cgroup slices.
- /system.slice – Created via the SystemD system itself
- /podruntime.slice – Should contain
kubeletand the container runtime environment, which is
containerdfor me. This must be created by the user.
- /kubepods.slice – Is where work loads are actually scheduled and controlled by kubelet.
- Kubelet should be configured with the following as hinted at by the design document
- The following additional configuration need to occur within your host to make this viable:
/etc/systemd/system/podruntime.sliceneeds to contain the following:
[Unit] Description=Limited resources slice for Kubernetes services Documentation=man:systemd.special(7) DefaultDependencies=no Before=slices.target Requires=-.slice After=-.slice
* `/etc/systemd/system/kubelet.service.d/10-cgroup.conf` needs to contain the following: ```toml [Service] CPUAccounting=true MemoryAccounting=true Slice=podruntime.slice ``` * `/etc/systemd/system/containerd.service.d/10-cgroup.conf` needs to contain the following: ```toml [Service] Slice=podruntime.slice ```
Very helpful were the debugging tools. In particular the
systemd-cgls to list the CGroup hierarchy and the path
/sys/fs/cgroup. Here I was able to verify
kubelet was running under the
/system.slice which is definitely not the
intended segment. The files within
/sys/fs/cgroup appear unit-less in most places. I stumbled across a Facebook
post with a helpful reference which is a summary of information from the Linux Kernel.
All memory units are in bytes and CPU units are in microseconds (1000 microseconds to 1 millisecond).
Looks like I need to move the Pod runtime, both
containerd to work under
increase reserved resources for the base system daemons. Considering the ramifications of possibly providing reasonable
minimums for the
/system.slice too, so I can log into the box when
kubelet dies off.