Verifying Prometheus
• Mark Eschbach
Prometheus is installed into the cluster. Looks like the pods are running. Time to check it out. Looking at the
alerts. A simple kubectl port-forward service/prometheus-operator-prometheus --namespace platform-monitoring 9090
with a browser point to http://localhost:9090, then going to Status
on the top navbar, then
Targets
yields a whole bunch of randomly unhealthy targets. After some investigation I found I had forgotten to list
the outbound rules to allow all traffic to transit to all other worker nodes
The alertmanager
for Prometheus is still unhealthy. At this time I am uncertain how to force it to restart. It did
not gracefully recover after several minutes. helm delete prometheus-operator && helm install stable/prometheus-operator --name prometheus-operator --namespace platform-monitoringr
fails due to various CustomResourceDefintion s.
The following removed teh conflicting set of items:
leftover_crds="alertmanagers.monitoring.coreos.com podmonitors.monitoring.coreos.com prometheuses.monitoring.coreos.com prometheusrules.monitoring.coreos.com servicemonitors.monitoring.coreos.com"
for name in $leftover_crds
do
kubectl delete crd $name
done
Alright! Now the helm install
works. Only item left in the unhealthy targets are platform-monitoring/prometheus-operator-kube-proxy/0
.
This target set is attempting to communicate with software running on the actual hosts. Looks like port 10249 is the host metrics data.
Probably important we have that information. Adding the metrics-port
option
should fix the issue however I am interested in why EKS does not automatically add that option. There was an EKS
guide which specifically talks about configuration options to pass the Helm chart for Prometheus. Specifically --set alertmanager.persistentVolume.storageClass="gp2",server.persistentVolume.storageClass="gp2"
which looks reasonable. gp2
is a general performance SSD.
Well, no luck in finding details there. Time to continue to move forward.
Aug 15 17:36:28 ip-10-1-33-125.us-west-2.compute.internal kubelet[9685]: F0815 17:36:28.070008 9685 server.go:147] unknown flag: –metrics-port
Well, a step in the correct direction! Odd the kubelet
does not show as a failed unit within systemd
:-/. Looking
at the output there is no reference to metrics or port 10249. Which is a bit annoying. The option metrics-bind-address
also does not work. I have been assuming that kubelet
is responsible, or at least some other component in Kubernetes,
for exporting that service. I might have a fundamental misunderstanding of what that service is.
So kube-proxy is a service to manage
network forwarding. I have verified there is a kube-proxy
container running on a target host. According to docker inspect
the configuration is stored at /var/lib/kube-proxy-config/config
. It looks like it is controlled by a configuration
mapping which is bound into the container. cat
ing the file I can confirm it is bound on 10249 however I think I see
the problem. The metrics binding address is loopback
with 127.0.0.1
. The documentation says this is the default
so I am wondering if I missed a step somewhere along the way.
Turns out a template like thing exists in kube-system
under configmaps
kube-proxy-config
. I am wondering what
changes should be made here. Perhaps just patching the metricsBindAddress
. Turns out I did miss a step
buried all the way at the bottom of the long README.
So the best I came up with to update the configmap
for kube-proxy-config
was the following:
# Patch to listen on all interfaces
kubectl patch configmap/kube-proxy-config \
-n kube-system \
--type merge \
--patch "$( kubectl get -o yaml configmap kube-proxy-config --namespace kube-system|sed 's/127.0.0.1/0.0.0.0/g')"
# Force kuby-proxy to be reloaded
kubectl get pods --namespace=kube-system |awk '{print $1}' |grep "kube-proxy" |xargs -n 1 kubectl delete --namespace "kube-system" pod
The first pipeline replaces the references to exporting data on the localhost
interface. However at least in K8S 1.13
an update to this configmap
does not cause the kube-proxy
services to restart. The second pipeline retrieves the
system pods, grabs only the names, further refines the results to kube-proxy
, then finally deletes each pod one at a
time. If I was doing this often I would probably search for a way to have xargs
do this in parallel however that may
be a future improvement. It took about 15 minutes for the pods to be removed and bounce back.
Now all indicators are good in Prometheus. alaertmanager
shows some complaints regarding cluster issues due to my
inexperience with the system. Unless you know, I compare the time stamps and convert from UTC to local. Turns out this
occurred earlier, before I made a number of changes.
Next up: Do new nodes join in a healthy state? Terminating everything in the group comes back correctly. Terminating a single node generates reasonable results. Looking good so far! Tearing down and recreating the Prometheus installation results in a reasonable installation.
Time to push it!