Over the holiday weekend an ECS task locked up and became unresponsive. From what I could tell there weren’t any outstanding work or any problems when this event occured. Almost felt like the task was reaped by the OOM Reaper. However one big problem I ran into was tracking down the corrolation between the AWS ECS task UUID and the actual container running on the nodes. Sure, probably wouldn’t be a big problem with a small number of containers, however under a real production load certain we’ve got a bunch of a specific type of container running, giving fine grained control over processor and memory usage.

ECS Task Metadata

I decided the best way was to work backwards from ECS into the container. I’m not sure if this is the best path, however I first discovered this issue by auditing the ECS Service event log, which gives me UUID. Unfortunately after AWS had purged the stopped task by the time I got to auditing the logs. This is where the problem originated from.

The ECS system defines a set of resources on a link-local endpoint accessible via HTTP. An ECS agent exposes these to the containers, however I’m unsure if this is done on a per-container basis or a per-system basis. The origin is well known for each container at http://169.254.170.2. It would make sense if these services were container specific as I’ve seen references to running multiple ECS clusters on the same nodes, however I don’t have practical expierence with doing this.

This meta-data service prepended to a token in the environment variable AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is how applications obtain their IAM profiles using implicit provisioning. Unforunately the ECS implementers stopped short of implementing additional units of information, such as the full container identifier and task UUID.

Based on the discussion trying to go from a Task ARN to a Container ARN to a Docker Conatiner ID, the solution should be relatively easy from the other direction to figure out the Task ARN. At least according to this comment it should be as straight forward as requesting GET http://172.17.0.1:51678/v1/tasks?dockerid=${HOSTNAME} from within the container. I’ll have to validated this under higher bandwidth.

Obtaining the Container ID

The easiest way to obtain a Docker Container ID in vanialla operation is to just grab the host name. This gives the truncated Container ID. I feel like I would be painting myself into a box by depending on the hostname as there are a lot of practical reasons for a host name to change. I would rather have a deterministic immutable method of obtaining the Container ID.

Turns out this is exposed via the cpuset which configures which CPU nodes an application may run on. Docker uses this mechanims on a per-container basis to allow fine grained control of CPU scheduling restrictions. It’s deterministic and immutnable. I feel like I’m depending on a magic value, however at this time it’s the best I’ve found. According to this gist by nmarley we should be able to just cat /proc/1/cpuset.

As a general rule if you are attempting to get the Container ID as PID 1 this should work well. However this seems a bit janky to me. Although some applications may modify the cpuset, you’ll generally know about this. I feel like a better resolution is ot cat /proc/self/cpuset since other process PID 1 may have terminated by the time the interested program has began. Although one could argue PID 1 should be the only program truely aware it’s running in a container.

So cat /proc/self/cpuset produces a cgroup path using a *nix-style path. The last component of the path provides the full Container ID. So that should resolve that issue. Perhaps exposing the value as an environment variable like DOCKER_CONTAINER_ID would make this a bit easier. I’m slightly sad there isn’t a better mechanism to figure out the full name of the conatiner, but meh.

Pulling it together

Based on the information I could gather, I want to test out the following approach to obtaining the Task ARN for easier debugging in the future.

export DOCKER_CONTAINER_ID=$(cat /proc/self/cpuset |xargs basename)
export ECS_TASK_ARN=curl --silent "http://172.17.0.1:51678/v1/tasks?dockerid=${DOCKER_CONTAINER_ID}" | jq '.Arn' 

Tomorrow I will place this theory against the test of practice and see what the results are. Hopefully I’ll be able to land the feature and make deubgging easier for all of us.