DevOps: DoD AWS Ref L2
• Mark Eschbach
Onto the fancy diagrams for the reference architecture! They have various recovery time objectives, ranging from greater than a day to less than an hour. Base operational costs are a function of the recovery time according to the structure of the AWS DoD SRG Reference. Some of these should probably be implemented as a function of the volume of requests.
For those you poor saps like me who are familiar with Cisco’s symbols for networking but not AWS’s icons I did find their sample icons. Definitely not an exhaustive resource on the subject, but it does help with digesting their diagrams. Unforunately although they recommend you label the diagrams they are guilty in many aspects of not following through with this. I first went with the PowerPoint as it was the smallest of the set. The SVG + EPS set include PNG renderings but still lack a description of lines.
Recovery Time Objective (RTO) >= 1 day
This is a rather simple system which looks like a standard three tier ( user client, application service, data store ) with some augmentation. The entire system is housed within a single VPC with a number of subnets dividing the hosts by position within the stack. AWS recommends using their cloud watch, cloud alarm, and cloud trail products to monitor the logs of the nodes within the system. They recommend the usage of S3 for static assets and the storage of snapshots of the host. The entire system runs within a single availability zone. The only tier allowed to directly access the internet is the edge tier. All other tiers are prevented from accessing the internet.
On the edge tier is labeled ‘DMZ’ within their diagram. At the edge tier we have two nodes: a bastion host and a reverse proxy. The bastion host will have a security group restring access to the port allowing access ot the VPC. Lines running from the bastion host I’m assuming indicate there may be connections from the bastion to the hosts in other tiers. The reverse proxy is setup in an autoscaling group and intended to provide access to the application tier. I’m not entirely sure the reasoning behind the reverse proxy over ELB. I could see using a reverse proxy being used to direct specific subresources to different applications or perhaps it’s a cost saving effort for caching.
App servers are in an autoscaling group as well as their own secruity group. There is no recommendation against intercommunication between nodes, which I have seen prohibited at many data centers where my software has run. There isn’t much said about this teir in the diagram.
Database tier is rather interesting. I wouldn’t expect there to be an autoscaling group behind this since most databases don’t play well with the loss of a member, trying to stay consistent and all. Their recommendation is to use a volume for the stored data, I’m assuming they would recommend to use Amazon’s key storage system to access the volume. I would also assume they would recommend using EBS since they were touting it’s benefits.
They don’t offer a whole lot on the benefits or the intents of the nodes, kind of fail at documenting the strengths and weaknesses of what they are.
Recovery Time Objective >= 1 hour
I kind of feel like they cheated a little here. The stack from the RTO of 1 day is the same but duplicated between two availability zones, some elastic load balancers, and a router between the two availability zones. It’s important to note bastion hosts may only access nodes within their availability zone and may not access nodes from any other availability zone.
Before the edge tier there is an ELB which balances traffic between the reverse proxies of the availability zones. The reverse proxies sit in the same autoscaling groups. Reverse proxies hit another set of elastic load balancers outside of the availability zones and VPCs which then target individual application services. Interestingly all applications are setup to target the primary database in the first availability zone, completely ignoring the standby database. Although database nodes are unlikely to fail that would be an interesting recovery case since all application services would need to be cycled.
Listed in their key attributes sections they finally touch on the point of having a reverse proxy: to mitigate against application level attacks targeting flaws in the underlying frameworks. I’m interestined in researching how the reverse proxy would be different than using the ELB for this, especially sinc ethe ELB would be hardened and maintained by the AWS folks. They emphasize the AWS backups and snapshots shouldn’t be publicly accessible from the internet, which totally make sense.
Oddly enough they recommend the bastion hosts to be in a powered off state until there needs to be some adminstrative activity against a particular availability zone.
Recovery Time Objective less or equal to 1 hour
There is an aweful lot of hand waving about deploying in a high availability capacity. Funny part here is the applications are still only connected to the primary database system and not the read only hot standby replica.
That is all folks
I’m not going to touch the levels great than two. There is an aweful lot of work peering with specific DoD networks and ensuring traffic is properly routed. It looks like a lot of fun that I don’t have a good reason to deal with right now.