Thoughts On Social Media

Two decades ago social media was not a significant phenomenon. Facebook, Instagram, YouTube, TikTok, Twitter, and all the other social media platforms so embedded into people’s lives today did not…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Fire drills for software teams

Teams involved in the post-mortem will come up with remediation work, aimed at preventing this kind of failure from happening again. Everyone who’s had their fingers burned will be highly motivated to define this work, and make it happen. The team wants to end this period with a renewed sense of safety.

But there’s always the nagging question, the same kind of concern that slinks out of the security realm and nestles on the shoulders of engineers: What are we missing? Where’s the next failure going to happen? What are we doing — or not doing — right now that’s going to bite us in six months?

Post-mortems should be enlightening events, but they’re also by definition in reaction to a particular failure. It’s hard to find the boundaries of the post-mortem analysis, so that you can point beyond and ask, “But what about that thing we’re not thinking about yet?” We need an additional practice, one that can produce more failures — which is to say, more learning — in a controlled environment.

DoSomething had a file-loss disaster of our own in the summer of 2016. The general summary is all too familiar, especially after reading about that GitLab incident: one of us restarted a file transfer process in production, late at night. The process started with an incorrect configuration (tread carefully with any tool that has a “destructive mode”), and diligently deleted hundreds of thousands of files before we were able to identify & stop it. Much worse, we discovered that our core backup process had been silently failing for some time. We eventually managed to recover 94% of the files from old backups and other sources, but the pain of the recovery effort, not to mention the unrecoverable 6%, have stayed with us.

The nature of fire drills will of course vary among teams and industries. For us, fire drills started with two main concerns: lost or corrupted data sets, and lost deployed systems (servers and networks). Fire drills crucially include communication, documentation, and other human factors inside their scope, whereas Chaos Monkey-style failure engineering restricts its scope to software design.

Our infrastructure team set a goal to run two to three fire drills per quarter. To determine where to start, we picked a system that was critical, but fairly self-contained. We wanted to concentrate first on the mechanics and logistics of running a fire drill for the first time. We needed to get through some basic questions: What does it look like? Who’s involved? Who’s in charge? How long does it take? How do you define the disaster? How do you record what happens? What do you do when it’s over? And so on.

Our team has run enough of these to define some requirements and practices:

We’re going for a certain level of verisimilitude, with a cast of characters and clear boundaries around the action. So we started with a kind of film treatment:

That’s the plot summary. Note that it also defines what’s in scope, and what’s not: In this case, AWS itself is out of scope for the exercise.

For a cast of characters, we wanted to specify not just who was involved, but what roles they should play. This serves as part of our institutional memory. It also lets us play with changing some of the parameters, or expanding the scope of the problem, in subsequent drills against the same system.

In a fire drill for an org our size (which is to say, a staff of 60), it made sense to us to have a single person making the big decisions (that’s the A). More than one person could be doing the work (the R). Others need to give feedback and provide institutional memory, especially if we’re trying to recreate a system with incomplete documentation (the Cs). And some others just need to know when the drill starts, how it’s going, and when it’s over (the Is).

In practice, this can be simple: four R, A, C, I bullet points at the top of a doc. But deciding who is involved, and how, is crucial for providing clarity in a confusing situation. One of the most costly problems in an emergency is not knowing who needs to make critical judgment calls.

How will we know when we’ve brought the system back? Are we integrating this system into a larger deployment to prove that it works normally, or running a test suite against it? Similarly to an agile team’s “definition of done,” this step tells the team what success looks like, so they can end the fire drill.

We use the artifice of a fire drill to make some bets about how we’ll react. How do we think we’ll restore systems? What are we worried about? We can compare this analysis to what actually happens, and extract important information from the deviations between plan and reality.

Not every team may take this step, as they may want the test to include how well the team analyzes the situation and creates a resolution path in the heat of the moment.

Rayhill emphasizes the “authenticity of the drill(s)…. Being able to simulate a real drill with real actors and real consequences is the only way to challenge your organization to be as ready as they can be.”

Let’s say a system goes down, and the backups are unavailable — in other words, the actual incident that DoSomething experienced in July 2016. How long should the fire drill go on? Does the team keep banging its head against the wall for twenty-four hours, or call it quits after eight? It’s especially important to set a time limit for when drills go wrong, so this doesn’t become a demoralizing practice.

In the case of an absolute failure to restore, end the fire drill, fix the fatal problems, and reschedule the drill.

In the heat of the moment, we may make a series of failed attempts, or consider approaches that we soon abandon. We want to write these down as a timeline to review later. This is the series of events we can examine later to find holes in our decisions, systems, and documentation.

At DoSomething, we’re likely coordinating the response as usual in Slack, but we’ve typically kept response notes in a shared Google Doc, so that everyone can contribute and review. As institutional memory, these notes are gold: save them and keep them organized.

After the event, analyze as you would an actual incident. Look at the timeline and all of the artifacts, including the rebuilt system. Did we get it back up and running? Where were the interesting gaps? If we didn’t get the system to pass our acceptance criteria, what are the most crucial pieces to get in place before we rerun the fire drill?

We’re shooting for a simple rule: we must address crucial gaps uncovered by one fire drill before we start another. This helps prevent us from making the same mistakes in two subsequent fire drills. It also mitigates against the team getting overwhelmed by its backlog.

Other teams may decide not to enforce this serializing of remedies and subsequent drills. Rayhill allows that “the remedy for a certain issue might be costly to the organization.” If her team determined the cost was too high, they would still “document our reasoning so that it was available to the auditors.”

The primary principle of fire drills is to use a lab environment to create as much information as possible about failures. Use drills to discover weak spots, and subsequently to pound away at those weak spots until they become stronger. Your weaknesses might be around communication and decision-making, freshness and availability of backups, alerts and monitoring, documentation, unnecessary complexity, or automation.

Most likely, teams that run fire drills will discover failures across a number of categories. Each of these discoveries should cause a celebration, because the team that finds a problem in the lab can fix it before it bites them in the wild.

Thoughts On Social Media

Fire drills for software teams

Add a comment

Related posts:

What is Biotox Gold?

BIFROST The future of Web3 is multichain

Pomodoro Technique