Verify an EKS Kubernetes Cluster
• Mark Eschbach
As apart of a cloud migration project I am migrating a set of services into AWS’s EKS. The EKS cluster has been built however not verified to work as expected. As per my normal hesitation of claiming something is done, I would like to verify the cluster can perform some basic operations. The following are minimum requirements:
- Be able to push a deployment
- Have data persist between deployments for database like systems
Overall these are fairly basic requirements. Future tasks will include things like instrumenting the system, probably using Prometheus and InfluxDB.
Deployment Test
I have an irrational fear of Yaml, probably because of my aversion to negative space. Plus, Terraform allows for cool things with references, calculations, etc. Really I am just trying to justify using Terraform here instead of templating with something like Helm. So here is the base Terraform resources.
resource "kubernetes_namespace" "test" {
metadata {
name = "cluster-test"
}
}
resource "kubernetes_deployment" "test" {
metadata {
name = "test"
namespace = "cluster-test"
}
spec {
replicas = 2
selector {
match_labels = {
project = "cluster-test"
}
}
template {
metadata {
labels = {
project = "cluster-test"
}
}
spec {
container {
image = "nginx:1.7.8"
name = "proxy"
resources {
limits {
cpu = "0.5"
memory = "512Mi"
}
requests {
cpu = "250m"
memory = "50Mi"
}
}
liveness_probe {
http_get {
path = "/nginx_status"
port = 80
http_header {
name = "X-Custom-Header"
value = "Awesome"
}
}
initial_delay_seconds = 3
period_seconds = 3
}
}
}
}
}
}
In theory this should create two nginx containers and verify their health. However this is not the case!
$ kubectl get deployments --namespace=cluster-test
NAME READY UP-TO-DATE AVAILABLE AGE
test 0/2 2 0 16m
To investigate I need to drill down into the deployment. My kubectl
foo is not strong enough to list the pods from
the deployment, which would be pretty cool. I’ve heard it’s possible with using the label like -l project=test
however my brief attempts do not yield positive results. Let’s do it the hard way:
$ kubectl get pods --namespace=cluster-test
NAME READY STATUS RESTARTS AGE
test-bc568c794-tsdwr 0/1 CrashLoopBackOff 9 14m
test-bc568c794-xvzdn 0/1 CrashLoopBackOff 9 14m
With the names of the pods we can get some information using the describe
attribute. This will produce a lot of
output. Most isn’t actually relevant right now however will be in the future!
$ kubectl describe pod --namespace=cluster-test test-bc568c794-tsdwr
Name: test-bc568c794-tsdwr
Namespace: cluster-test
Priority: 0
Node: ip-10-1-32-78.us-west-2.compute.internal/10.1.35.178
Start Time: Tue, 13 Aug 2019 10:01:37 -0700
Labels: pod-template-hash=bc568c794
project=cluster-test
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.1.35.58
Controlled By: ReplicaSet/test-bc568c794
Containers:
proxy:
Container ID: docker://3d6a1ca6fd1c0fe9d244ee699d0c4b6c34b1b592e8ac3425b9ebe095387e90ff
Image: nginx:1.7.8
Image ID: docker-pullable://nginx@sha256:2c390758c6a4660d93467ce5e70e8d08d6e401f748bffba7885ce160ca7e481d
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 13 Aug 2019 10:18:53 -0700
Finished: Tue, 13 Aug 2019 10:19:02 -0700
Ready: False
Restart Count: 11
Limits:
cpu: 500m
memory: 512Mi
Requests:
cpu: 250m
memory: 50Mi
Liveness: http-get http://:80/nginx_status delay=3s timeout=1s period=3s #success=1 #failure=3
Environment: <none>
Mounts: <none>
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes: <none>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned cluster-test/test-bc568c794-tsdwr to ip-10-1-32-78.us-west-2.compute.internal
Normal Pulling 19m kubelet, ip-10-1-32-78.us-west-2.compute.internal pulling image "nginx:1.7.8"
Normal Pulled 19m kubelet, ip-10-1-32-78.us-west-2.compute.internal Successfully pulled image "nginx:1.7.8"
Normal Started 19m (x2 over 19m) kubelet, ip-10-1-32-78.us-west-2.compute.internal Started container
Warning MissingClusterDNS 19m (x8 over 19m) kubelet, ip-10-1-32-78.us-west-2.compute.internal pod: "test-bc568c794-tsdwr_cluster-test(0355202d-bdec-11e9-9277-064095f01a98)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
Normal Created 19m (x3 over 19m) kubelet, ip-10-1-32-78.us-west-2.compute.internal Created container
Warning Unhealthy 19m (x6 over 19m) kubelet, ip-10-1-32-78.us-west-2.compute.internal Liveness probe failed: HTTP probe failed with statuscode: 404
Normal Killing 19m (x2 over 19m) kubelet, ip-10-1-32-78.us-west-2.compute.internal Killing container with id docker://proxy:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 19m (x2 over 19m) kubelet, ip-10-1-32-78.us-west-2.compute.internal Container image "nginx:1.7.8" already present on machine
Warning BackOff 4m35s (x70 over 19m) kubelet, ip-10-1-32-78.us-west-2.compute.internal Back-off restarting failed container
Well that is fairly straight forward and too the point there! Looks like the primary issue is the live-ness check is
returning a 404, which is assumed to be a failure. This makes sense from a stock nginx
stand point since the instance
is not configured in any way. Time to remove the liveness check.
According to the cluster the containers are live and working as expected. Time to verify the export functionality work as expected.
$ kubectl get deployments --namespace=cluster-test
NAME READY UP-TO-DATE AVAILABLE AGE
test 2/2 2 2 28m
Proxied Connection
Next up is to access the containers through a proxied connection. I do not want to expose these on a the public
internet, and in fact they are on a protected VPC. So a simple kubectl port-forward
will fail with an attempt to
reach out to the private IP. EKS thankfully requires authentication to access the master node. Which leaves us with
forwarding the master API to our local machine and using the API there to further forward.
Easily done with kubectl proxy --port=8090
. This exposes the API on http://localhost:8090
with proper
authentication semantics. If you are running a local instance of Kubernetes will need to run on a port other than
8080
. The service can now be looked up at http://localhost:8090/api/v1/namespaces/cluster-test/services/test/proxy
.
This currently fails with:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "no endpoints available for service \"test\"",
"reason": "ServiceUnavailable",
"code": 503
}
Time to dig in an find out why it can not locate it’s targets! The service
depends on endpoint
s which actually
find the target nodes. In this case I had some incorrect settings for the location of nodes. This may be inspected via
$ kubectl describe endpoints test --namespace=cluster-test
Name: test
Namespace: cluster-test
Labels: <none>
Annotations: <none>
Subsets:
Addresses: 10.1.35.217,10.1.36.174
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
http 8080 TCP
Events: <none>
In this case the port
is wrong. I looked at various things. Did a lot of trial and error like things. In the end
I was not able to get the service to come through. However I did find the nginx
container exports on tcp:8080
, not
tcp:80
. Definitely a step in the correct direction. The following message is rather concerning, pointing to an
improperly configured node. This leads me to believe I may have additional settings wrong so I am heading back to the
EKS Worker Node user_data
script after lunch to figure out what is going on.
Warning MissingClusterDNS 9s (x9 over 6m53s) kubelet, ip-10-1-39-134.us-west-2.compute.internal pod: "test-5846f8cdb6-srs48_cluster-test(29b430c5-bdfc-11e9-9277-064095f01a98)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
Proper User Data for EKS
Given the error above, I was concerned my nodes were not properly configured. Strangely the nodes properly registered into the cluster which is why I had assumed the configuration was normal.
Turns out EKS has greatly simplified the deployment of EKS nodes.
No longer are you required to build a complicated set of in place sed
operations but run a script. Instead the node
should have the following as apart of the userdata
field:
#!/bin/bash -xe
yum update -y
/etc/eks/bootstrap.sh \
--kubelet-extra-args '--node-labels=aws.region=${var.aws_region},node.type=baseload' \
${aws_eks_cluster.cluster.name}
In theory the yum update
command would not be needed, however I was seeing failures related to that at the time.
Probably a good idea to update the node on boot anyway. Effectively the node will auto-configure which is great.
Endpoint shows resources, service proxy fails
I had an endpoint which successfully showed resources with kubectl describe endpoint test --namespace=cluster-test
however no matter how many different methods I tried was unable to properly proxy the service to my local machine. Many
times the request would just hang. Most of the time I would receive the following.
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "no endpoints available for service \"test\"",
"reason": "ServiceUnavailable",
"code": 503
}
Unfortunately this didn’t resolve the underlying issue with attempting to proxy into the cluster. Turns out the security
group from the master
ENIs need to allow outgoing traffic to the worker nodes on the specified ports to work as
expected. Once this security group rule was change then the service was able to work as expected.