Modern SaaS applications are complicated. They are sprawling tangles of technology that can be hard to reason about, even on a good day. Planning for when things jump off the happy path is even harder. At The Scale Factory we help a lot of organizations validate their disaster recovery plans. Here are some of the issues we often see, even when they have all of their runbooks in place and their recovery objectives defined.
When seconds count, the restore is only minutes away
On a chilly January afternoon in 2009, a US Airways flight ditched in the Hudson River after a flock of geese struck both engines leaving them inoperable. In the investigations that followed, flight simulations showed that the aeroplane could have safely made it to nearby LaGuardia airport without performing a dangerous emergency landing on a river. Despite this, the flight crew are lauded as heroes and the event is now known as “The Miracle on the Hudson”. Why? Because the simulations were unrealistic and involved a fully briefed crew making the turning back to LaGuardia immediately after engine power is lost. When they re-ran the simulation with extra time added for debugging the issue, the plane crashed, likely killing all onboard.
The point here is that during a real disaster recovery situation you cannot assume it will be self-evident that a restore will be needed and that the recovery process will be initiated immediately. We often see recovery plans that only budget for the bare minimum of recovery time. If you have a recovery time objective of 1 hour, and you know you can do your restore procedure in 50 minutes, that leaves only 10 minutes for you to work out what’s actually going wrong. Budget for a realistic amount of debugging time during the disaster to determine the needed course of action. For the crew of US Airways flight 1549, there was only a 35-second difference between safe landing and complete disaster.
When dealing with cloud backups you also have to think about the limitations of your tooling and how that might affect restore times. In AWS, most services that support snapshots as their native backup mechanism (e.g. RDS, EBS) require you to create a whole new resource from that snapshot to restore. You will not be restoring to the existing resource, which means not only do you have to account for time to create a new resource from the snapshot, but also time to reconfigure your system to use that resource. For a database restore, this may include reconfiguring applications with new credentials, importing the new instance into your infrastructure as code configuration or flushing caches. It’s crucial to understand how long the entire process will take end-to-end, and not just the data restore times for a single resource.
Sink or sync
In DevOps circles you just need to mention the name “Knight Capital” to get knowing nods from all assembled. Their story is now the fable told to junior engineers to illustrate what not to do when releasing software. In 2012 Knight Capital, as the largest trader of equities on the two largest stock exchanges in the world, were at the top of their game. However, on August 1st disaster struck. An update to an order routing system was applied to 7 of the 8 running servers. Due to re-use of an existing flag, this eighth unpatched system started to process orders incorrectly. Over the next 45 minutes, Knight Capital lost $440 million. If this is such a commonly known story, why am I retelling it now? Because we so often see this kind of error baked into disaster recovery strategies.
Ultimately, Knight Capital’s issues came from a simple desynchronization in their system. Two versions of the same application running simultaneously. Generally speaking the industry has learnt from this, deployments are automated, servers are treated like cattle and not pets. However, companies often don’t apply these same learnings to their data sources. Frequently we see recovery strategies that almost guarantee that two versions of data will be running after a recovery. This happens when organizations don’t take into account the relations between their data sources. A simple example would be a relational database which stores references to files in object storage. Too often organizations will treat these two data stores as completely separate systems with their own recovery procedures and objectives, but of course they aren’t. If these systems are restored independently then they run the risk of getting out of sync. Now this probably won’t result in the same scale of disaster that befell Knight Capital, but this kind of data corruption can be difficult if not impossible to clean-up after the fact.
Modern cloud SaaS applications often have many different data stores that only become a useful whole when combined. Organizations need to be aware of these dependencies between their data stores and plan around it. Often these data stores will have different backup/restore mechanisms, for example snapshots vs object versioning, and processes will need to be put in place to ensure those methods can be synchronized during a real recovery operation. It’s useful to assemble a centralized data catalogue of all your data stores, from there you can compare related data stores recovery point objectives to ensure they are in line with each other.
The best laid schemes of mice and CISOs…
There is a single question you can ask to determine whether a recovery plan is up to scratch: “Have you tried it?”. Too often we hear the response: “no”. GitLab learned this the hard way in 2017 when an engineer accidentally wiped their primary production database. When they went to restore from backup, they found that none of the backup processes had been working as expected. Regular dumps of the database to an S3 bucket had been failing, but notifications were being black holed. Immediately they had to go off script, and this turned the incident from a standard runbook execution into an improvised scramble to restore order. By sheer luck, disk snapshots had been taken 6 hours before the outage for testing a new system in their staging environment. From this they were able to get GitLab back up and running, but it took 18 hours with a significant amount of data loss.
Having a disaster recovery plan written down is not good enough. Software development is fast-paced, systems change, features break, something that worked once is not guaranteed to work forever. The only way to check that your restore procedures are working is by actually performing them. Tests should be run regularly to find problems caused by changes. The tests should be run on systems that look as close to production as possible, with the same sizes of data and access controls. When these tests fail this gives the golden opportunity to update your plans to match reality in a safe environment. By testing regularly you are also training engineers to better perform the steps under pressure, reducing errors when a real issue strikes. Contingencies should also be planned for and tested. If key personnel are not present, is there a break glass procedure so others can perform the restore steps? Test that too.
You can simplify disaster recovery in the cloud by using native backup solutions. For example AWS Backup allows you to use the same backup and restore tooling across multiple different AWS Services, simplifying runbooks and thereby reducing the burden of testing procedures around restores. In addition, AWS Backup has recently released automated restore testing which can be used to further increase confidence in your restore tooling.
We can help
We’ve seen it all at The Scale Factory, and our expert consultants have helped guide numerous companies through their backup and disaster recovery journeys. Whether you need help setting up baseline tooling like AWS Backup, or want a holistic assessment of your disaster recovery procedures to ensure they are fit for purpose, we can help. Book in your free disaster recovery healthcheck today.
This blog is written exclusively by The Scale Factory team. We do not accept external contributions.