TLDR: The gang experiences an AWS outage; customers report SHOWSTOPPERs; the gang triages the issues and attempts to implement an elaborate disaster recovery plan
It is 11:03 AM on a Thursday, US-EAST-1, Pennsylvania. You've been triple booked just right and everyone at each call thinks you're at the other call. In this rare moment of freedom, you are adding custom emojis to slack.
A message from a developer:
This is unfortunate, it is your week on support rotation so the other calls cant get you out of this one.
You've crafted your triage process over many incidents and honed cutting diagnostic inquisitions. The goal here is efficiency and the mitigation of risk to lunchtime and as always, the secondary objective to expose the truth about how the PR comment you left was ignored and best practices spitefully neglected.
1. Estimate priority: Did a customer report it? Did we find it ourselves?
2. Assess impact: How many have failed?
3. Establish precise language: A few is not a number, earlier is not a timestamp. Did you check the AWS status page?
4. Identify patterns: Are they all failing the same way? What is the error signature?
5. Analyze chain of events: What did we change last?
The nuances in this last question is that a lot happens outside of the standard change management process. Some of these off-the-record/alternative-record things covered in triage include:
These three, plus the normal sprint work is at best recorded into only four different systems of record. Each of the systems is almost certainly the best tool for its job and each probably even has excellent integration capabilities. Unfortunately, the four are integrated manually in excel by four different people producing out-of-sync csv exports.
6. Initiate holy crusade: Some of the worst atrocities committed against software get to production under the guise of hot-fixes. The perpetrators must be identified.
This continues as more participants join the call - until - even though AWS status page is still showing ✅✅✅, someone announces:
If you get nothing else out of this article - please walk away with this: If you are having doubts about the status of AWS, check twitter.
This is good news.
While numerous system failure notifications are signaling that shit is hitting the fan, the mood on the call improves briefly as all are absolved of any potential guilt, its no ones fault.
This calm is short lived, the customers have been notified. They've escalated pissed-ness from CRITICAL
to SHOWSTOPPER
(from the resting baseline of URGENT
). They are asking for an ETA on the resolution and hourly updates on the status.
Large organizations have Recovery Time Objective (RTO) and Recovery Point Objective (RPO) standards for disaster/outage scenarios. RTO/RPO can be internal standards or sometimes compliance requirements. What RPO/RTO actually means is best illustrated with an example:
If you're playing a video game - a crash is a disaster. The time it takes to boot the game back up is the recovery time. When the game starts back up, you've lost some progress time and have to start from an earlier autosave - thats the recovery point.
The goal is to have systems in place to maintain business continuity by reducing data loss and minimizing downtime for end users. This is done by deploying redundant infrastructure and data backups in alternate locations since physical infrastructure is geographically vulnerable.
The org's AWS technical account manager tries to reassure everyone by confirming that issue is only regional. Having seen an amazing disaster recovery demo from the tech team six months ago, an executive on the call, feeling like they're about to save the day asks:
This is where we discover that DR capability is different from a DR demo proving that failover is possible. Software changes over time. If there is an expectation that the application has recovery capability, each change must be made with the consideration for how it will behave in failure scenarios. To be clear - not try/except wrapping some new code for some feature - but how will the system with this new code and feature behave when infrastructure around it starts to break. Adopting DR is a culture change that goes beyond the developer. Maintaining availability requires continuous overhead in time and effort on development, management and architecture.
The answer is to this question is as usual -
The contributing factors to this lack of confidence come from those same nuances of the What did we change last?
question.
When the devops lead is finally located. They remember that something has changed since the DR demo. The recovery process has changed and someone has to write an additional script. They explain the functionality briefly but unfortunately they can't do it. They have "a hard stop" and have "to drop" to watch their pets and children on their day off. They will then hand-off to you.
35 highly paid managers on the call that are normally skeptical of pair programming - become an engaged operating theater audience to a couple of developers hacking together a script that will never make it to version control.
When the script is finally written, a decision must be made.
The story has had a number of twists so far. It was determined that the original recovery plan is out of date and had to be duct taped back together. The plan's original creators have given their warnings and someone new must own the success or failure of the recovery. The decision should be made with caution of course, and so, the risks need to be discussed.
QA will list the environments they would like the script to be tested in, the business execs will bring up the need to work together on this with other organizational units. A few more people will join the call for a final per-hour burn rate approaching a developer’s monthly salary.
AWS will inevitably fix us-east-1, and very probably before the discussion concludes. Someone will interrupt to announce:
Everyone congratulates everyone on their hard work. The tech team sees a bright and hopeful future. They vow to bring this up in the post-mortem as an indisputable argument to convince product to finally pay down the tech debt.
There are multiple very well documented strategies and architecture patterns to achieving recovery objectives. While there is a number of these, they can be reduced to just two:
please share all of your great successes with manually orchestrating active-passive/pilot light recoveries in the comments
There is nothing fundamentally wrong about the passive approaches to recovery. The real problem is that unless the organization has capacity to continuously pay the overhead of maintaining and practicing recovery strategies, the recovery plan will always be out of date like in the story above. In active-active configuration, application have to naturally become regionally agnostic and shifting from one region to another naturally has to be automatic.
It is true that for many systems, especially those built on always-on infrastructure that have a per-hour cost, active-active is really expensive. For every compute instance you have serving traffic, you need to have another that is also turned on in a different region.
On-demand pricing is common for most products calling themselves serverless. With that, redundancy doesn't have to cost double and with much of serverless being heavily config driven the complexity is also reduced.
We will be posting a series on building global applications with serverless. Check out the first part in the series >> Routing and Health Checks
Enter your email If you would like to get an email the next time we post.
We post about ~1-2x per month, and up to once a month about company news.
Fork and deploy a hello world project onto Cyclic faster than it has taken to read this page. Experience the power.
Deploy on Cyclic 🚀