SFO AWS Summit
• Mark Eschbach
I was the designated sacrafice for the AWS 2017 Summit from my team.
I was later than I wanted due to an engine failure on Capital Corridor 527. I arrived at about 10:05.
AWS Batch.
Concentrating on retry and general architecture.
Targets really large batch jobs.
On Prem
On-prem for batch computing. Procure a cluster, probably homogenous, standard data center.
Batch scheduler delegates from a work queue to the available resources based on teh constraints of the systems.
“If you are fortunate or unfortunate to have some strange people to arrive at your job they might want to use GPUs or FPGAs”
What it looks like on the AWS Cloud
Job Queue could be a number of schedulers; including AWS batch. Thier schedulers launch appropriate AMIs.
Not ment to be revolutionary, just not designed to be implemented yourself
Details
Deals with provisioning and scaling out of the box (managed), integrated with AWS (security, S3, etc), cost optimized resource provisioning.
Primitives should be something just handled and the focus should be on the application itself. Built to provide a Docker hosting environment.
Roles: batch dev, batch admin, batch admin (infra focused). Batch dev is a domain expert and constructing the software. Batch users are general concerned with the results.
Schedules tasks on ECS cluster for the system which matches the workload targets. Can handle dep graph failures and management, including failures, sequencing, fan-in, fan-out, job submission, etc.
They shyed away from Clusters to avoid the typical stigma in Batch computing pushing 100% utiliziation constantly. They call them compute environments. The scheduler can bring on-line systems which match the constraints provided by the job and the family. Aggregates jobs into least number of jobs possible. Goal was to make Spot usage as simple as possible. Set the maximum size of on-demands to use. Will find the closest instance type which matches the spot price after the on-demand are exhusted.
Offers an unmanaged compute environment which uses an ECS cluster on custom designed hosts.
Exit code 0 is the only one considered success, all others are considered failures within the system.
There are no additional costs beyond the cost of the resources themselves.
Only in US-East-1, thinking about rolling out to more. New features job restries and retry strategy.
MXNet uses Docker and host specific volumes; apparently to access the Nvidia drivers on the host.
Voice UI Best Practices
Speaker: Liz Myers.
Focusing on skills best practices. They were constratiting on building skills, going on about all of the services you could use around them. Promising such better stuffs…let’s see. Top result for Alexa skills today was Oprah’s magazine. Interesting you can have Alexa read a for you.
They are pushing the following: regulard updates, vocal variety, ability to maintain context (resuming, replaying, etc), related content. By vocal variety she means an assortment o variety voices, sounds clips, etc.
For regular updates she recommends the following: update content regularly, use sounds effects and signature clips to represent transitions, brisk pace with short sentences and prompts. The Jeopardy content will only allow one question a day, requiring users to build report with the brand. Alexa is moving towards multi-turn expierences for an interactive conversation.
Pushing designs towards wide level using one-shot models. Speaker authored the fortune cookie, which has 5 stars in Germany. One-shot model: Wake Word (alexa), start word (ask), invocation name (fortune cookie), utterance (remainder). Uterrances are also known as slots.
Recommends “Designing Voice User interfaces: Principles of Conversational Expierences”
Lunch
Since 11am until 12:15p they’ve been telling us lunch was on the first floor. I had to walk through a ton of vendors then found a single table with a small set of sandwiches. Stopped a security specialist, asking them what they do I got the response “we analyze your requirements.” Couldn’t get any details out of them and they didn’t have any sales perosnel to talk with
Found the food on the second floor and quiet table on the third to releax and listen to the talk on security problems. Most of the talk was the typical “don’t leave vunlernable doors open”.
Data-driven Postmortems
Presented by Jason Yee from Datadog
“The problems we work on at Datadog are hard and often doen’t have obvious, clean-cut solutions, so it’s useful to cultivate your troubleshooting skills no matter what role you work in.” “The only real mistake is one from which we learn nothing”
Devosp - Culture, Automation, Metrics, and Sharing
- Over focused on automation
- Focus should be more on Culture and Sharing
Focus in postmortum should be on culture and sharing.
“You’re either building a learning oranization or you will be losing to someone who is.” - Andrew Clay Shafer
“Collecting data is cheap; not having it when you need it can be expensive”
Average cost of 1 hour of down time is 21K/hour. 12K for 1 GB/year replicated 3 times.
4 qualities of good metrics
- must be well understood
- sufficent granular
- Tagged and filterable
- (dont’ konw if filterable and tagging was severable…but didn’t mention a fourth)
Datadog ptich:
- Stores in seconds
- Data format [name] [quantity] [when] [tags..]
- 15 months without aggregation
- Notebooks - Take screen shots and records notes.
AWS notes:
- 15 months storage, 1 minute gruanularity up to 15 days, 5 mintues up to 63 days, 1 hour granularity up to 15 months
Metrics
As imagined through a donut facotry
- High level business metrics - how many donuts which are sellable are delivered
- Resource metrics - Utilization, saturation, errors, availability
Work metric (website down) -> Resources -> Events : [Resource is resursive]
There is no singular “root cause” - there are a lot of reasons whyt hings go wrong. Contributing factors aren’t typically technical, generally human element. Post mortums should be teypically done after an incident, not during the incident.
Roles of users to collect data from
- Responders
- Identifiers
- Affected Users
Collect thier perpsectives for what they did, what they thought, why their actions. Use open ended questions and get people to write them down. By writing it down they must clairify their thoughts and create some pictures. Human memory drops significantly after 20 minutes and levels out in the 2 day - 2 month time period. Human data corruption factors: stress, sleep deprivation, burnout, blame, fear of punitive action. The goal of a post mortum is to allow people to be trasnparent. Recommended reading: Blameless postmortum and The Human Side of Postmortems.
Biases to be aware of: anchoring, hindsight, outcome, availablity, and bandwagon effect. Availability (aka recency) provides a baises based on something which had recently occured and the same conculsion shouldn’t always be drawn.
Datadog postmortems are e-mailed company wide encouraging collaborations and increases company wide visibility on what occured. Recurring postmortem meetings which regularly force them to review the state of the system. They have 4 sections within thier summary: customer impact, severity of the outage, components affected, what resolved the outage. Next question in their template: how was the outage detected? How was it detected, what metrics showed it failed, was there a monitor on the metric, how longe before it was detected? How did we respond: who was the incident owner and who was invovled, slacker archieves and timeline of events, what went well, what didn’t go so well? “Why did it happen?” is a narative of the cuases. Last question is “how do we prevnet is again?” which should answer the question: now, next (next sprint), later, follow up notes.
He was stressing Chatops for achiving. Forces people to journal the events and what is going on.
Talk slides. Datadog also offers some advice.
Deep Dive on CICD and Docker
John Pignata, Startup Solutions Architect
Other periods of waste within our delivery: waiting for someone to test something in the pipeline. The objective is to deliver ideas quickly and reliably deliver good ideas to our customers? Quickly defined as close to no wait states. Reliably means to have the deploys occur without failures.
Tenants to persue CD:
- Frequrenece reduces difficulty
- Latency between check-in and production is waste
- Consistency improves confidence
- Automation over toil – Tiem wasted when we could just automate it away
- Empowered developers make happier teams – Makes teams happier
- Smaller batches sizes are easier to debug
- Faster delivery improves software development practices
Phases:
- Source (Work effort output): Version control, Branching, Code Review
- Build Phase: Compliation, unit tests, static analysis, packaging
- Test: System integration tests, Load tests, Security tests, acceptance tests
- Production: deployment, monitroing, measuring, validation
Continous integration covers source and build; guarded by things like branches and feature toggles. Continous Deliverying goes into production, with a gate to ship into production. Important feature is constant feedback thorughout the process.
Best Practices
- Docker image should be reproduciable
- Only install runtime deps for prod
- Minimize changes in each layer to maximize cachability
- Maintain a
.dockerignore
file toe xclude uneeded files from the image
Building docker images
- Recommends to tag artifacts with a source version
- Avoid using
latest
andproduction
tags - Optimize for build speed
- Collocate build process – Run the registry in the same region you are running in
Running Docker Images with many images
- Plug for ECS
- ECS supports deployment in place (100% health, 200% max)
- ECS will deploy in place with rolling (50% health, 100% max)
- Canary deploys - New target group with the canary group, register canary instnaces, monitor, then roll out
- Blue green - Identical clone, allows new system to rise up, tested, and roll out working setup via DNS
- Ensure your application can function with both versions
Building a deployment of pipeline
- Cool graphics of pipes connected to show the automation.
- Now they’ve jumped into into using AWS specific tooling which only fits a very narrow defintion of what you can do.
Reflection
I came the conference today hoping it would be equal or better ot the MongoDB conference I came to last year. It was on the same level, not much better or worse; definitely enjoyable. I’m on the fence if I’ll join the conefence tomorrow or not. Many o fhte sessions were hyper focused on AWS and not the solutions they provide. I’ll give them the benefit the speakers seemed very knowledge of both their offerings and the general understandings of the communitites. Something like the user groups I attend are closer to what I was hoping for. A place where people may collaborate and engage in discussions on a specific topic. Next time I think I might want to go to a conference with a buddy.