k8s: Fixing my broken cluster
• Mark Eschbach
At home my kubernetes cluster runs on bare metal. Tools have definitely come a long way since the days of 2015 where you had to bootstrap a cluster using an EtcD HTTP service in the cloud. However there is still space for improvement and mishaps.
Prometheus & Alertmanager
One space I had struggled with was getting the Prometheus Operator running. The cluster itself was running well enough, however there was a slew of errors which I was not expierenced enoug to handle nor did I have hte time available. I was excited last night after serveral months, a few Kubernetes upgrades, and a refreshed pair of eyes got it going. Also helping was the need for thermal management of my office. As the weather has been turning we no longer wish to generate excess heat and the constant restarting pods kept the CPU load between 20-60%.
A system which generates alerts is great. Having those alerts go somewhere actionable is even better. After reviewing and toying around with several examples I settled on this configuration:
global:
resolve_timeout: 5m
receivers:
- name: "null"
- name: "slack"
slack_configs:
- api_url: 'https://hooks.slack.com/services/<identifier>'
icon_url: https://avatars3.githubusercontent.com/u/3380462
send_resolved: true
channel: '#monitoring'
title: '[:] Monitoring Event Notification'
text: >-
*Alert:* - ``
*Description:*
*Graph:* <|:chart_with_upwards_trend:> *Runbook:* <|:spiral_note_pad:>
*Details:*
• *:* ``
route:
group_by:
- job
group_interval: 5m
group_wait: 30s
receiver: "null"
repeat_interval: 12h
routes:
- match:
alertname: Watchdog
receiver: "null"
- match:
receiver: 'slack'
continue: true
Definitely still room for improvement. The Template language seems to be evolving and several examples just did not work. For now I am both excited by the alerts and sad there is still more work to be fully awesome. Definitely a stable cluster however there are plently of things to make it better.
CoreDNS
One of the first intelligable alerst I receieved was the following:
• alertname: KubeDeploymentReplicasMismatch
• deployment: coredns
• endpoint: http
• instance: 172.31.0.72:8080
• job: kube-state-metrics
• namespace: kube-system
• pod: prometheus-operator-kube-state-metrics-8467c5bdbc-fwczn
• prometheus: platform-prometheus/prometheus-operator-prometheus
• service: prometheus-operator-kube-state-metrics
• severity: critical
This alert indicates a deployment is not spinning up to the desired capacity for some reason. kubectl tree
is an
awesome plugin to kubectl
which allows one to display resources for a specific resource.
:> kubectl tree deployment coredns -n kube-system
NAMESPACE NAME READY REASON AGE
kube-system Deployment/coredns - 182d
kube-system ├─ReplicaSet/coredns-5644d7b6d9 - 182d
kube-system ├─ReplicaSet/coredns-66bff467f8 - 3h57m
kube-system │ ├─Pod/coredns-66bff467f8-fzjz2 True 3h57m
kube-system │ └─Pod/coredns-66bff467f8-k2g6n False ContainersNotReady 3h57m
kube-system └─ReplicaSet/coredns-6955765f44 - 129d
Much better than running multiple kubectl
to trace all of resources. As shown above the primary problem is one of the
containers is not ready. The logs shown by kubectl logs coredns-66bff467f8-k2g6n -n kube-system
points out a loop
in the DNS configuration.
CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b
[FATAL] plugin/loop: Loop (127.0.0.1:39386 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 2267741465582667364.4598883878041902408."
One of the machines in my cluster operates as a DNS server for the network also. On that machine the DNS clients are configured to query on loopback. The referenced page offers several methods to fix this issue. I would like for local clients to continue using that service, so this leaves the option of figuring out how to change the Kubernetes configuration for this.
The directive references in the possible solutions may also use multiple DNS
servers, which is easy enough to fix. Many answer on StackOverflow
seem to be incorrect by suggesting you disable the loopback detection mechanism entirely! However there is a gem
within the rough: kubectl edit cm coredns -n kube-system
. Afterwards
you will need to kill the malfucntioning pod or roll out a new deployment to pickup the changes.
Prometheus being reborn
Despite the many steps forward after several hours the CPU has climbed on the other machine in the cluster with similar messages to the original being spewed everywhere. For some reason some prometheus pods are being recreated often. There are no log messages which indicate why these pods are being restarted or what their points are, however they are being restarted often.
Apr 24 16:07:53 kal kubelet[27916]: W0424 16:07:53.954532 27916 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/ad9a0ed5-f5b6-420b-9ae7-595757c5e23f/volumes/kubernetes.io~secret/prometheus-operator-kube-state-metrics-token-6h8x2 and fsGroup set. If the
volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699
After a bit of digging through the logs I was not really able to find any details as to why a set of containers is
repeatedly exiting without cause or reason. Issuing kubectl get pods -w -n platform-prometheus
shows a steady state
of the pods with no activity for termination. When scrolling down to the bottom of stable/prometheus-operator
in frustation of thinking I might need to configure additional persistent volumes I noticed KubeProxy
instructions. Perhaps this is what I missed?
A quick kubectl -n kube-system edit configmap kube-proxy
to check shows I overlooked this step. Easy enough to
restart the proxied nodes with a kubectl get pods -n kube-system
and a
kubectl delete pod kube-proxy-{5xk94,m5jmg} -n kube-system
. This definitely reduced the CPU load. Perhaps the issue
is not entirely visible processes dieing off? Or it is dieing quicker than is logged ot the clients?
Now there are fewer entries in the logs of the file system changes. Additionally kubelet
CPU usage is significantly
down, only spiking up to 30%.
metrics-server
After searching around for a bit I noticed the Nodes
dashboard did not load. Despite the claims at
kube-state-metrics
I was
hoping maybe there was some sort of depdenecy maybe? kubeadm
does not install and configure metrics-server
out of
the box
which in many deployment scenarios is quiet reasonable I would imagine.
So what does it aggregate? The Official Document
is not very helpful beyond cluster information. metrics-server
extends the API server
and extracts data from the kubelet
to centralize to a single metrics-server
pod. Maybe everything will just work if
I deploy it?
Taking a look at the repository I am not sure how to deploy
this. Yay for a helm chart. The metrics-server
pod crashes :-(.
Crashing is a result of attempting to connect to the nodes via the short name. Names do have mappings on my network however I am not entirely sure these FQDNs are properly setup for the system. Given the cluster has an additional DNS suffix, which FQDN the system is trying to resolve is a bit confusing.
Others seem to have a similar problem. metrics-server
supports 4 different possible methods to locate a node:
- Hostname
- InternalDNS
- InternalIP
- ExternalDNS
- ExternalIP
Hostname
is the default and deoes not work. InternalDNS
& ExternalDNS
seem to exhibit the same problem.
InternalIP
and ExternalIP
produces complaints regarding the SAN not containing the IP address of the target
machine. The certificate is another Yak to Shave for the future. To get around the SAN issue for now one may use
the --kubelet-insecure-tls
flag along with InternalIP
.
More to do
The cluster is still not a fully funcitonal battle station however there is much less churn and the platform is more stable now than it was. I am glad for the tour through the internals of the metrics pipelines and the various components within the cluster. Producing less heat in my office as the weather turns is the ultimate goal though!