Alrighty, I’ve got a basic system setup using Terraform and AWS to orechestrate the construction of the an ECS system. Originally I was testing using the ELB classic setup. However now I would like to finish the task out. The ALBs load live in subnets which are allowed to communicate with the conatiners. The containers are setup to register with a target group. A target group are the nodes to forward reqeusts to.

With the ECS console for the containers under the Events tab of the service from the Cluster navigation item, the containers report they are unable to register with the target group because AccessDenied. Sometimes I wish there was a report of the specific permission forcing the failure. To the documentation!

After searching high, then searching low, then doing the hoke-poke: I’ve ventured across the permissions documentation! Interesting, the past examples I’ve encountered included elements like elasticloadbalancing:RegisterInstancesWithLoadBalancer and elasticloadbalancing:DeregisterInstancesFromLoadBalancer. I couldn’t find those among the resource there. Target Group related permissions did exist though! Exactly what I needed.

RegisterTargets will hopefully resolve the registration problem. DeregisterTargets will hopefully be the reciprocal. Bingo! The test container policy looks like the following:

resource "aws_iam_role_policy" "ecs-apps-service-policy" {
    name = "ecs-service-role-policy.${data.template_file.pod_domain.rendered}"
    role = "${aws_iam_role.ecs-apps-service.id}"
    policy = <<CFG
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "elasticloadbalancing:Describe*",
        "elasticloadbalancing:RegisterTargets",
        "elasticloadbalancing:DeregisterTargets",
        "ec2:Describe*",
        "ec2:AuthorizeSecurityGroupIngress"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
CFG
}

Next up on the perfection queue: we should be able to schedule multiple containers per host. Port conflicts are where the problems since I’ve had to specificy the port for each instnace of the container. From the manual, specifying 0 or ommitting the hostPort should work. While updating the service specs through terraform I got the hosts stuck; wouldn’t even respond to pings; although they were responding to EC2 healthchecks (sigh). Well that was slightly terrifying: I couldn’t get the hosts to come back on-line. I’m wondering where I went wrong here becuase I’m really praying this doesn’t happen when these systems move out to production.

Turns out I am just an idiot and the root of the problem is actually missing roles. Discovery process is logging in and tailing /var/log/ecs/ecs-agent* file (which ever is the latest). Instead of creating a standard AWS role for an ECS host the annoyingly want you to create one for each of your accounts :-/. Lame. RTFM yourself if ya don’t believe me. From the manual you need to the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:CreateCluster",
        "ecs:DeregisterContainerInstance",
        "ecs:DiscoverPollEndpoint",
        "ecs:Poll",
        "ecs:RegisterContainerInstance",
        "ecs:StartTelemetrySession",
        "ecs:Submit*",
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "*"
    }
  ]
}

Well that worked. I still see strange errors ocassionally I’m not sure how to fix, especially regarding CPU metrics and such. I guess we’ll see how that works going forward.

Considering the autoscaling

Alrighty, so we have scaling groups. However these are manually configured scaling groups. So many choices for scaling. The particular pool I’m concerned with needs high network throughput and CPU throughput. Until we get some applications in place I’m assuming it’s better to avoiding the more complex scaling. My predisposition of the particular system I’m working with tells me thte system will require memory based scaling

Deploying the code

The next major challenge is figuring out how to get the software in whatever version into the ECS cluster. Looking at AWS CodeDeploy targets EC2 instances directly and doesn’t support ECS. I was really hoping that would be a turn key solution. Taking a few minutes with CodePipeline they don’t integrate with ECS either. Docker DataCenter looks intresting but is obscenely expensive. At that price we could have some fairly beefy instances serving our computation needs.

Back to ECR for me. I was hoping there was something simpler.