So my goal is to provide a simple script which will connect to an initialized and unsealed Vault cluster then setup the application. I’ve played around with both terraform output, even with the -module option doesn’t, doesn’t extract variables unless explicitly declared in output stanzas. Makes sense but I was hoping for an easy win. Next up is extracting hte values from the tfstate file using jq or the likes. Turns out tfstate files use a path entry contianing an array for the path to the module object. I don’t entirely understand the design decision, however I’m a little bummed as this changes the jq query into something of a monster and hard to script. terraform show will require some interesting stream manipulators to extract the data.

I think I’ll invoke the mechanical turk who is the applier for this one. I’ll come back and try again after I have the other aspects of the process complete. In the meantime I’ll use a JSON file to represent the extracted state to give me a clean separation of concerns.

So onto scripting Vault! I’m going to stick with jq for extraction; the queries tend to be really simple like .orc.intake and such. So that was actually fairly painless and straight forward to get done.

Back to the Vault issue, I’ve resorted to using the output subcommand which also requires the information to be delcared in HCL. I really dislike the duplication of information but I don’t have much of a choice right now. Running the script to setup Vault with the infrastructure information worked like a charm though. Easy as pie now.

In theory with Terraform taking the configuration from the test lab into production should be a simple and easy proces. Not when you have to manually muck with things apparently. I ran out of Network ACLs. A problem we knew was a risk as the number of interconnections between systems increase. We chose to take the approach of having a rule per availability zone used beacuse it was easier to implement. Turns out ease of implementation doesn’t quiet land the way I was hoping.

Pulling the rules from the live account and the test account into a spreadsheet to compare. I noticed a rather funny thing: terraform is refusing to add the 20th ACL. There are 19 on the account now so it should allow for the last rule entry. So after looking a little closer at the spreadsheet I realized the test-1 account actually has 21 including the default rule, while live-0 has the minimum required. Well that makes this a little easier then.

With this new lesson I definitely think we should have spent more time thinking about our network architecture. I’m sure this will haunt us for a while. Effectively we should have chosen a more concise way for network expressions. Going with a more specific solution I would say 8 AZ would be the maximum (2^3). Aligning to the byte boundary that would leave 5 bits to identify the service. you could also split this into 2^2 for the host class and 2^3 for the service. Host classes could include the following: DMZ (bastions, LBs, etc), application services, databases, and reserved element. The 2^3 would give 8 services within each class, such as a Postgres, CouchDB, Vault, etc. I’m not sure if this is the best approach but I think I’ve reached the end of the theory portion.

Interestingly in AWS Startup Blog they recommend flipping the service and AZ. Sounds like one is very likely to run into the ACL problem I’ve encounter; but then again they sound like their approach to NACLs is to set and forget. Disappointing!

I scratched my head for a while trying to figure out why a set of services couldn’t egress from the VPC. Turns out I forgot the NAT gateway rules for my subnet. I knew I should have tested against another account before applying it to production! #lessons.