prometheus-operator uses a subchart for kube-state-metrics which contains the knobs I would like to set. Current question at hand is how does one pass parameters to a subchart? From this issue it looks like I might be able to use something like kube-state-metrics.collectors.verticalpodautoscalers: false in the values. In an ideal world kube-state-metrics would not repeatedly crash if a collector was unable to scrap data, but let’s give this a try.

kube-state-metrics:
  collectors:
    verticalpodautoscalers: false
    horizontalpodautoscalers: false

YAML above appears to have done the trick on getting the autoscaler complaints gone. Preemptively disabling the horizontal autoscaler since this has not been setup yet, perhaps a future fix.

EtcD configuration

Next up is to remove the EtcD probes. This is definitely somethign which should be fully monitored. The following changes to the Prometheus configuration resolved errors related to this.

kubeEtcd:
  enabled: false
defaultRules:
  rules:
    etcd: false

*v1.PartialObjectMetadata

E0426 22:40:03.792142       1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource

A Rancher issue seems to point towards a bad operator configuration, specifically kubevirt-operator as installed by k3os. A quick review of the kube-system and platform-prometheus namespaces did not show any particular issue. Right, I’ve come across this before, with a solution in progress. Unfortunately there does not appear to really be a solution besides not getting rid of operators.

Conclusion

Overall primary master node which was churning CPU between 10%-30% of CPU for kubelet is now down to 5-11% of that work load. Influx shows the system on churning between 5-40% now. There are still a few issues in the log I would love to chase down but the thermal returns have hit something reasonable and a mostly functioning monitoring & alerting system in place.

The following items seemed effective:

  • Patching CoreDNS to forward to the correct resolvers instead of looping back.
  • Installing the metrics-server onto the cluster, configuring the service to scrape using the InternalIP and ignore the certificates without IPs in the SAN.
  • Disable the metrics collectors for the VerticalPodAutoscaler
  • Disable the metrics for EtcD
  • Enable the AlertManager to send messages via Slack.