Stream of conciousness - Mark Eschbach (Software Developer && System Analyst)

At home my kubernetes cluster runs on bare metal. Tools have definitely come a long way since the days of 2015 where you had to bootstrap a cluster using an EtcD HTTP service in the cloud. However there is still space for improvement and mishaps.

Prometheus & Alertmanager

One space I had struggled with was getting the Prometheus Operator running. The cluster itself was running well enough, however there was a slew of errors which I was not expierenced enoug to handle nor did I have hte time available. I was excited last night after serveral months, a few Kubernetes upgrades, and a refreshed pair of eyes got it going. Also helping was the need for thermal management of my office. As the weather has been turning we no longer wish to generate excess heat and the constant restarting pods kept the CPU load between 20-60%.

A system which generates alerts is great. Having those alerts go somewhere actionable is even better. After reviewing and toying around with several examples I settled on this configuration:

global:
  resolve_timeout: 5m
receivers:
  - name: "null"
  - name: "slack"
    slack_configs:
    - api_url: 'https://hooks.slack.com/services/<identifier>'
      icon_url: https://avatars3.githubusercontent.com/u/3380462
      send_resolved: true
      channel: '#monitoring'
      title: '[:] Monitoring Event Notification'
      text: >-
        
          *Alert:*  - ``
          *Description:* 
          *Graph:* <|:chart_with_upwards_trend:> *Runbook:* <|:spiral_note_pad:>
          *Details:*
           • *:* ``
          
        
route:
  group_by:
    - job
  group_interval: 5m
  group_wait: 30s
  receiver: "null"
  repeat_interval: 12h
  routes:
    - match:
        alertname: Watchdog
      receiver: "null"
    - match:
      receiver: 'slack'
      continue: true

Definitely still room for improvement. The Template language seems to be evolving and several examples just did not work. For now I am both excited by the alerts and sad there is still more work to be fully awesome. Definitely a stable cluster however there are plently of things to make it better.

CoreDNS

One of the first intelligable alerst I receieved was the following:

  • alertname: KubeDeploymentReplicasMismatch
  • deployment: coredns
  • endpoint: http
  • instance: 172.31.0.72:8080
  • job: kube-state-metrics
  • namespace: kube-system
  • pod: prometheus-operator-kube-state-metrics-8467c5bdbc-fwczn
  • prometheus: platform-prometheus/prometheus-operator-prometheus
  • service: prometheus-operator-kube-state-metrics
  • severity: critical

This alert indicates a deployment is not spinning up to the desired capacity for some reason. kubectl tree is an awesome plugin to kubectl which allows one to display resources for a specific resource.

:> kubectl tree deployment coredns -n kube-system
NAMESPACE    NAME                              READY  REASON              AGE  
kube-system  Deployment/coredns                -                          182d 
kube-system  ├─ReplicaSet/coredns-5644d7b6d9   -                          182d 
kube-system  ├─ReplicaSet/coredns-66bff467f8   -                          3h57m
kube-system  │ ├─Pod/coredns-66bff467f8-fzjz2  True                       3h57m
kube-system  │ └─Pod/coredns-66bff467f8-k2g6n  False  ContainersNotReady  3h57m
kube-system  └─ReplicaSet/coredns-6955765f44   -                          129d

Much better than running multiple kubectl to trace all of resources. As shown above the primary problem is one of the containers is not ready. The logs shown by kubectl logs coredns-66bff467f8-k2g6n -n kube-system points out a loop in the DNS configuration.

CoreDNS-1.6.7
linux/amd64, go1.13.6, da7f65b
[FATAL] plugin/loop: Loop (127.0.0.1:39386 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 2267741465582667364.4598883878041902408."

One of the machines in my cluster operates as a DNS server for the network also. On that machine the DNS clients are configured to query on loopback. The referenced page offers several methods to fix this issue. I would like for local clients to continue using that service, so this leaves the option of figuring out how to change the Kubernetes configuration for this.

The directive references in the possible solutions may also use multiple DNS servers, which is easy enough to fix. Many answer on StackOverflow seem to be incorrect by suggesting you disable the loopback detection mechanism entirely! However there is a gem within the rough: kubectl edit cm coredns -n kube-system. Afterwards you will need to kill the malfucntioning pod or roll out a new deployment to pickup the changes.

Prometheus being reborn

Despite the many steps forward after several hours the CPU has climbed on the other machine in the cluster with similar messages to the original being spewed everywhere. For some reason some prometheus pods are being recreated often. There are no log messages which indicate why these pods are being restarted or what their points are, however they are being restarted often.

Apr 24 16:07:53 kal kubelet[27916]: W0424 16:07:53.954532   27916 volume_linux.go:49] Setting volume ownership for /var/lib/kubelet/pods/ad9a0ed5-f5b6-420b-9ae7-595757c5e23f/volumes/kubernetes.io~secret/prometheus-operator-kube-state-metrics-token-6h8x2 and fsGroup set. If the 
volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699

After a bit of digging through the logs I was not really able to find any details as to why a set of containers is repeatedly exiting without cause or reason. Issuing kubectl get pods -w -n platform-prometheus shows a steady state of the pods with no activity for termination. When scrolling down to the bottom of stable/prometheus-operator in frustation of thinking I might need to configure additional persistent volumes I noticed KubeProxy instructions. Perhaps this is what I missed?

A quick kubectl -n kube-system edit configmap kube-proxy to check shows I overlooked this step. Easy enough to restart the proxied nodes with a kubectl get pods -n kube-system and a kubectl delete pod kube-proxy-{5xk94,m5jmg} -n kube-system. This definitely reduced the CPU load. Perhaps the issue is not entirely visible processes dieing off? Or it is dieing quicker than is logged ot the clients?

Now there are fewer entries in the logs of the file system changes. Additionally kubelet CPU usage is significantly down, only spiking up to 30%.

metrics-server

After searching around for a bit I noticed the Nodes dashboard did not load. Despite the claims at kube-state-metrics I was hoping maybe there was some sort of depdenecy maybe? kubeadm does not install and configure metrics-server out of the box which in many deployment scenarios is quiet reasonable I would imagine.

So what does it aggregate? The Official Document is not very helpful beyond cluster information. metrics-server extends the API server and extracts data from the kubelet to centralize to a single metrics-server pod. Maybe everything will just work if I deploy it?

Taking a look at the repository I am not sure how to deploy this. Yay for a helm chart. The metrics-server pod crashes :-(.

Crashing is a result of attempting to connect to the nodes via the short name. Names do have mappings on my network however I am not entirely sure these FQDNs are properly setup for the system. Given the cluster has an additional DNS suffix, which FQDN the system is trying to resolve is a bit confusing.

Others seem to have a similar problem. metrics-server supports 4 different possible methods to locate a node:

Hostname
InternalDNS
InternalIP
ExternalDNS
ExternalIP

Hostname is the default and deoes not work. InternalDNS & ExternalDNS seem to exhibit the same problem. InternalIP and ExternalIP produces complaints regarding the SAN not containing the IP address of the target machine. The certificate is another Yak to Shave for the future. To get around the SAN issue for now one may use the --kubelet-insecure-tls flag along with InternalIP.

More to do

The cluster is still not a fully funcitonal battle station however there is much less churn and the platform is more stable now than it was. I am glad for the tour through the internals of the metrics pipelines and the various components within the cluster. Producing less heat in my office as the weather turns is the ultimate goal though!

Mark Eschbach

k8s: Fixing my broken cluster

Apr 24, 2020 • Mark Eschbach

Tags: