k8s: Fixing My Broken Cluster III
• Mark Eschbach
prometheus-operator uses a subchart for
kube-state-metrics which contains the knobs
I would like to set. Current question at hand is how does one pass parameters to a subchart? From
this issue it looks like I might be able to use something like
kube-state-metrics.collectors.verticalpodautoscalers: false in the values. In an ideal world
would not repeatedly crash if a collector was unable to scrap data, but let’s give this a try.
kube-state-metrics: collectors: verticalpodautoscalers: false horizontalpodautoscalers: false
YAML above appears to have done the trick on getting the autoscaler complaints gone. Preemptively disabling the horizontal autoscaler since this has not been setup yet, perhaps a future fix.
Next up is to remove the EtcD probes. This is definitely somethign which should be fully monitored. The following changes to the Prometheus configuration resolved errors related to this.
kubeEtcd: enabled: false defaultRules: rules: etcd: false
E0426 22:40:03.792142 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
A Rancher issue seems to point towards a bad operator configuration,
kubevirt-operator as installed by
k3os. A quick review of the
namespaces did not show any particular issue. Right, I’ve come across this before, with a solution in progress.
Unfortunately there does not appear to really be a solution besides not getting rid of operators.
Overall primary master node which was churning CPU between 10%-30% of CPU for
kubelet is now down to 5-11% of that
work load. Influx shows the system on churning between 5-40% now. There are still a few issues in the log I would love
to chase down but the thermal returns have hit something reasonable and a mostly functioning monitoring & alerting
system in place.
The following items seemed effective:
- Patching CoreDNS to forward to the correct resolvers instead of looping back.
- Installing the
metrics-serveronto the cluster, configuring the service to scrape using the
InternalIPand ignore the certificates without IPs in the
- Disable the metrics collectors for the
- Disable the metrics for
- Enable the AlertManager to send messages via Slack.