You Might Not Have A Disaster Recovery Strategy

by Timothy Carioscio

Though our infrastructure today is much more robust than it's ever been, failures still will occur. It's an unfortunate truth in technology. This fact keeps many IT directors and CTO's up at night. Every company whose operation relies on their tech stack remaining operational should have a disaster recovery strategy. In the unlikely event of a hardware failure, system outage or even a hack, companies should have a plan to recover their system(s) within their pre-defined time objectives.

You may be saying to yourself, "Phew, I was drawn in by that click-bait-y title, but I have all those things. I'm in the clear." Not so fast. Having that disaster recovery plan is a great first step, but if you and your team haven't tested it out, you can't be sure that it'll work as designed. That recovery time objective (RTO) and recovery point objective (RPO) may be right, but without putting your recovery plan through its paces, it's hard to be sure. There may even be a flaw in the design of your DR strategy that will prevent you from recovering your system at all! Wouldn't you rather find that flaw during a test and not during an actual outage? If you have a disaster recovery strategy, but it hasn't been tested, you don't have a disaster recovery strategy.

Equinox Gold Corp., one of our most innovative and forward-thinking clients, agrees with the thesis of this post. They are in the process of evaluating a new disaster recovery strategy for their S/4HANA workloads. Before trusting their productive environment to a new DR strategy, they decided to put it through its paces. They scheduled and a simulated outage of their sandbox S/4HANA instance during their work hours, and gave the team who would be performing the recovery advance notice. Ross and Aaron from the CONTAX basis and infrastructure team were invited to participate.

The team practiced recovering the system per the new DR strategy. It was a success! They were able to recover the system and update the DR documentation to include the hiccups that they ran into during the DR exercise. Equinox has proven that their disaster strategy is feasible, and the team learned quite a bit from their first recovery attempt.

In order to build on their success, they're going to schedule a second DR trial run to confirm and validate the new assumptions that have been made. Additionally, insofar as it's possible, the second trial will be performed by resources who hadn't participated in the first. This will get more folks exposure to the DR process which help them build their comfort level with the process. They will also be more likely to identify flaws or omissions in the DR documentation, since they are less less familiar with the process. For those two reasons, routinely performing DR trials with a wide variety resources is important even if the strategy hasn't been changed.

If Equinox ever has a production outage, they will be confident in their throughly tested DR strategy. All of the members of the team who are recovering the instance will know their roles and have already performed a recovery before in the low-stress environment of a DR trial run. They will have detailed documentation and their muscle memory ought to kick in. They will be able confidently quote the RPO and RTO to their business users, because they have recovered the system before.

Disaster recovery is always stressful. SAP workloads are business-critical and unplanned outages can cost thousands or even millions of dollars. Don't let an untested DR strategy add to your stress or worse, be the reason that you're unable to recover at all.

Title photo by Hush Naidoo on Unsplash

About the author: Timothy Carioscio

Tim is an AWS evangelist. Rather than having his head in the clouds, he lives with the Cloud in his head.