k8s: Fixing My Broken Cluster III
• Mark Eschbach
prometheus-operator
uses a subchart for
kube-state-metrics
which contains the knobs
I would like to set. Current question at hand is how does one pass parameters to a subchart? From
this issue it looks like I might be able to use something like
kube-state-metrics.collectors.verticalpodautoscalers: false
in the values. In an ideal world kube-state-metrics
would not repeatedly crash if a collector was unable to scrap data, but let’s give this a try.
kube-state-metrics:
collectors:
verticalpodautoscalers: false
horizontalpodautoscalers: false
YAML above appears to have done the trick on getting the autoscaler complaints gone. Preemptively disabling the horizontal autoscaler since this has not been setup yet, perhaps a future fix.
EtcD configuration
Next up is to remove the EtcD probes. This is definitely somethign which should be fully monitored. The following changes to the Prometheus configuration resolved errors related to this.
kubeEtcd:
enabled: false
defaultRules:
rules:
etcd: false
*v1.PartialObjectMetadata
E0426 22:40:03.792142 1 reflector.go:178] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
A Rancher issue seems to point towards a bad operator configuration,
specifically kubevirt-operator
as installed by k3os
. A quick review of the kube-system
and platform-prometheus
namespaces did not show any particular issue. Right, I’ve come across this before, with a solution in progress.
Unfortunately there does not appear to really be a solution besides not getting rid of operators.
Conclusion
Overall primary master node which was churning CPU between 10%-30% of CPU for kubelet
is now down to 5-11% of that
work load. Influx shows the system on churning between 5-40% now. There are still a few issues in the log I would love
to chase down but the thermal returns have hit something reasonable and a mostly functioning monitoring & alerting
system in place.
The following items seemed effective:
- Patching CoreDNS to forward to the correct resolvers instead of looping back.
- Installing the
metrics-server
onto the cluster, configuring the service to scrape using theInternalIP
and ignore the certificates without IPs in theSAN
. - Disable the metrics collectors for the
VerticalPodAutoscaler
- Disable the metrics for
EtcD
- Enable the AlertManager to send messages via Slack.