As apart of a cloud migration project I am migrating a set of services into AWS’s EKS. The EKS cluster has been built however not verified to work as expected. As per my normal hesitation of claiming something is done, I would like to verify the cluster can perform some basic operations. The following are minimum requirements:

  1. Be able to push a deployment
  2. Have data persist between deployments for database like systems

Overall these are fairly basic requirements. Future tasks will include things like instrumenting the system, probably using Prometheus and InfluxDB.

Deployment Test

I have an irrational fear of Yaml, probably because of my aversion to negative space. Plus, Terraform allows for cool things with references, calculations, etc. Really I am just trying to justify using Terraform here instead of templating with something like Helm. So here is the base Terraform resources.

resource "kubernetes_namespace" "test" {
  metadata {
    name = "cluster-test"

resource "kubernetes_deployment" "test" {
  metadata {
    name = "test"
    namespace = "cluster-test"

  spec {
    replicas = 2
    selector {
      match_labels = {
        project = "cluster-test"

    template {
      metadata {
        labels = {
          project = "cluster-test"
      spec {
        container {
          image = "nginx:1.7.8"
          name = "proxy"

          resources {
            limits {
              cpu = "0.5"
              memory = "512Mi"

            requests {
              cpu = "250m"
              memory = "50Mi"

          liveness_probe {
            http_get {
              path = "/nginx_status"
              port = 80

              http_header {
                name  = "X-Custom-Header"
                value = "Awesome"

            initial_delay_seconds = 3
            period_seconds        = 3

In theory this should create two nginx containers and verify their health. However this is not the case!

$ kubectl get deployments --namespace=cluster-test
test   0/2     2            0           16m

To investigate I need to drill down into the deployment. My kubectl foo is not strong enough to list the pods from the deployment, which would be pretty cool. I’ve heard it’s possible with using the label like -l project=test however my brief attempts do not yield positive results. Let’s do it the hard way:

$ kubectl get pods --namespace=cluster-test
NAME                   READY   STATUS             RESTARTS   AGE
test-bc568c794-tsdwr   0/1     CrashLoopBackOff   9          14m
test-bc568c794-xvzdn   0/1     CrashLoopBackOff   9          14m

With the names of the pods we can get some information using the describe attribute. This will produce a lot of output. Most isn’t actually relevant right now however will be in the future!

$ kubectl describe pod --namespace=cluster-test test-bc568c794-tsdwr
Name:           test-bc568c794-tsdwr
Namespace:      cluster-test
Priority:       0
Start Time:     Tue, 13 Aug 2019 10:01:37 -0700
Labels:         pod-template-hash=bc568c794
Annotations: eks.privileged
Status:         Running
Controlled By:  ReplicaSet/test-bc568c794
    Container ID:   docker://3d6a1ca6fd1c0fe9d244ee699d0c4b6c34b1b592e8ac3425b9ebe095387e90ff
    Image:          nginx:1.7.8
    Image ID:       docker-pullable://nginx@sha256:2c390758c6a4660d93467ce5e70e8d08d6e401f748bffba7885ce160ca7e481d
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 13 Aug 2019 10:18:53 -0700
      Finished:     Tue, 13 Aug 2019 10:19:02 -0700
    Ready:          False
    Restart Count:  11
      cpu:     500m
      memory:  512Mi
      cpu:        250m
      memory:     50Mi
    Liveness:     http-get http://:80/nginx_status delay=3s timeout=1s period=3s #success=1 #failure=3
    Environment:  <none>
    Mounts:       <none>
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:            <none>
QoS Class:          Burstable
Node-Selectors:     <none>
Tolerations: for 300s
           for 300s
  Type     Reason             Age                   From                                               Message
  ----     ------             ----                  ----                                               -------
  Normal   Scheduled          19m                   default-scheduler                                  Successfully assigned cluster-test/test-bc568c794-tsdwr to
  Normal   Pulling            19m                   kubelet,  pulling image "nginx:1.7.8"
  Normal   Pulled             19m                   kubelet,  Successfully pulled image "nginx:1.7.8"
  Normal   Started            19m (x2 over 19m)     kubelet,  Started container
  Warning  MissingClusterDNS  19m (x8 over 19m)     kubelet,  pod: "test-bc568c794-tsdwr_cluster-test(0355202d-bdec-11e9-9277-064095f01a98)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
  Normal   Created            19m (x3 over 19m)     kubelet,  Created container
  Warning  Unhealthy          19m (x6 over 19m)     kubelet,  Liveness probe failed: HTTP probe failed with statuscode: 404
  Normal   Killing            19m (x2 over 19m)     kubelet,  Killing container with id docker://proxy:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Pulled             19m (x2 over 19m)     kubelet,  Container image "nginx:1.7.8" already present on machine
  Warning  BackOff            4m35s (x70 over 19m)  kubelet,  Back-off restarting failed container

Well that is fairly straight forward and too the point there! Looks like the primary issue is the live-ness check is returning a 404, which is assumed to be a failure. This makes sense from a stock nginx stand point since the instance is not configured in any way. Time to remove the liveness check.

According to the cluster the containers are live and working as expected. Time to verify the export functionality work as expected.

$ kubectl get deployments --namespace=cluster-test
test   2/2     2            2           28m

Proxied Connection

Next up is to access the containers through a proxied connection. I do not want to expose these on a the public internet, and in fact they are on a protected VPC. So a simple kubectl port-forward will fail with an attempt to reach out to the private IP. EKS thankfully requires authentication to access the master node. Which leaves us with forwarding the master API to our local machine and using the API there to further forward.

Easily done with kubectl proxy --port=8090. This exposes the API on http://localhost:8090 with proper authentication semantics. If you are running a local instance of Kubernetes will need to run on a port other than 8080. The service can now be looked up at http://localhost:8090/api/v1/namespaces/cluster-test/services/test/proxy. This currently fails with:

   "kind": "Status",
   "apiVersion": "v1",
   "metadata": {
   "status": "Failure",
   "message": "no endpoints available for service \"test\"",
   "reason": "ServiceUnavailable",
   "code": 503

Time to dig in an find out why it can not locate it’s targets! The service depends on endpoints which actually find the target nodes. In this case I had some incorrect settings for the location of nodes. This may be inspected via

$ kubectl describe endpoints test --namespace=cluster-test
Name:         test
Namespace:    cluster-test
Labels:       <none>
Annotations:  <none>
  NotReadyAddresses:  <none>
    Name  Port  Protocol
    ----  ----  --------
    http  8080  TCP

Events:  <none>

In this case the port is wrong. I looked at various things. Did a lot of trial and error like things. In the end I was not able to get the service to come through. However I did find the nginx container exports on tcp:8080, not tcp:80. Definitely a step in the correct direction. The following message is rather concerning, pointing to an improperly configured node. This leads me to believe I may have additional settings wrong so I am heading back to the EKS Worker Node user_data script after lunch to figure out what is going on.

Warning  MissingClusterDNS  9s (x9 over 6m53s)  kubelet,  pod: "test-5846f8cdb6-srs48_cluster-test(29b430c5-bdfc-11e9-9277-064095f01a98)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.

Proper User Data for EKS

Given the error above, I was concerned my nodes were not properly configured. Strangely the nodes properly registered into the cluster which is why I had assumed the configuration was normal.

Turns out EKS has greatly simplified the deployment of EKS nodes. No longer are you required to build a complicated set of in place sed operations but run a script. Instead the node should have the following as apart of the userdata field:

#!/bin/bash -xe

yum update -y
/etc/eks/ \
  --kubelet-extra-args '--node-labels=aws.region=${var.aws_region},node.type=baseload' \

In theory the yum update command would not be needed, however I was seeing failures related to that at the time. Probably a good idea to update the node on boot anyway. Effectively the node will auto-configure which is great.

Endpoint shows resources, service proxy fails

I had an endpoint which successfully showed resources with kubectl describe endpoint test --namespace=cluster-test however no matter how many different methods I tried was unable to properly proxy the service to my local machine. Many times the request would just hang. Most of the time I would receive the following.

   "kind": "Status",
   "apiVersion": "v1",
   "metadata": {
   "status": "Failure",
   "message": "no endpoints available for service \"test\"",
   "reason": "ServiceUnavailable",
   "code": 503

Unfortunately this didn’t resolve the underlying issue with attempting to proxy into the cluster. Turns out the security group from the master ENIs need to allow outgoing traffic to the worker nodes on the specified ports to work as expected. Once this security group rule was change then the service was able to work as expected.