It’s world backup day. I bet that’s a date on your calendar you were looking forward to for the whole 12 month run up. It should be a public holiday. OK, maybe not.
(Also, I’ve never tried backing up the world, but I expect you need quite a few tapes to fit it all in.)
For SaaS companies, your platform is your business and problems with the tech can have devastating consequences both human and financial. Investing in ways to recover from disasters is a way to protect against that. Backups are a mitigation against loss of data, whether that’s the result of component problems or user error.
Solutions for backups and disaster recovery need a different approach from how you manage the main bit of your workload. That especially shows when it comes to testing: with a typical SaaS workload, you can try things out with users - either real end users, or colleagues who are doing final testing. At some point your code meets the customer and you get to find out if they’re happy.
As you can imagine, I don’t recommend trying that with backups. I’m all for a bit of chaos engineering—in moderation, and perhaps even in production—but not to the extent of losing actual data to find out what happens.
The hard bit, then, is actually not the bit where you take the backup, and it’s not even the part where you restore the data. Those are important, I’ll grant, but the real value comes from having the means to assure you that backups and restores will work when you want them. Every time.
If you’re reading this, I assume you’re already running on the cloud - maybe even on AWS, the platform we specialise in at The Scale Factory. You’ll know how much easier it is to run a DR rehearsal or a test restore when you can bring up a copy of your environment within minutes, verify that things work, and manage all of that via automation.
Scripts and snapshots
Another great benefit of the cloud is having APIs to take a system snapshot. Cloud services make this an operation, or maybe two (one to trigger it, and another to check)—and also shift the responsibility for that operation away from you the customer.
With a bit of code, you can trigger a snapshot, make a copy, set some metadata and report status to your monitoring stack.
Beyond the basics
Trouble is, S3 is not a backup. With S3 (or Glacier, for that matter), AWS takes care of the has-my-drive-failed side of durability. That’s nice but as Corey Quinn explains in the article, you still need to worry about all the other ways you can lose your data.
It’s usually worth keeping the ability to wipe your backups by accident apart from the ability to delete or corrupt the originals. To add some extra isolation, you can set up a separate AWS account, and copy the snapshot data there. That’s usually a sound choice. You can copy the data outside of AWS - possibly to a different cloud provider, or onto on-premises equipment. Most firms wouldn’t have a strong case for that much isolation. If the backup data are binary blobs (or you can export them), you can copy those into S3 or Glacier and then use managed locks to ensure they don’t get deleted until the locks expire.
Backup management as a service
Unless you happen to be in the backups business yourself, though, all that effort on automation is some distance from the value you’re providing to customers.
This is super useful if compliance and evidence of compliance is a big part of how you’re doing backups. It’s one thing to do them well and another thing again to prove that. You probably know if you need to be compliant, because your customers or prospects are sending you spreadsheets and asking about standards. Or they’re not, and you can mostly focus on making sure you’re happy with your own plans.
Any backup management tooling ought to have a reporting feature built in, so you can see that it’s working and how well. The more hands-off the thing is, the more important that reporting becomes.
Unusually for AWS, the AWS Backup service has a pretty straightforward name and it kind of does what it says. Rather than dealing with Glacier vaults or S3 objects in your own logic, you define the things you want to back up, the schedule that this should happen in, and the lifecycle you want the backups to have.
Bear in mind that AWS Backup is for data, not infrastructure: if you have an EC2 instance, back it up, and then terminate the instance, restoring from AWS Backup gets you a similar instance populated with the same data. You don’t get back the exact same instance, and other details (such as tags) won’t be the same either.
The other strong point about AWS Backup is that it’s a single point to manage, and a single report, across a slew of AWS services. It’s far from the full set but the most common places to persist data are covered:
- Amazon DocumentDB
- Amazon DynamoDB
- Amazon EBS
- Amazon EC2
- Amazon EFS
- Amazon FSx
(except the NetApp ONTAP or OpenZFS flavours)
- Amazon Neptune
- Amazon RDS
- Amazon S3
and you can even back up on-premises storage, via the support for
- AWS Storage Gateway
(that’s the list at the time of writing; AWS are adding coverage for more services every quarter)
If you’ve used a few AWS services you’ll know that AWS loves a complicated service name. For example, AWS Backup Audit Manager. It’s another thing that does what it says. You define a report that answers a questions like “is everything getting backed up?”, plus how often you want to get an answer. The magic of the cloud slings that answer into S3, and it covers all your resources across the different accounts in your AWS organization.
With a bit of scripting – potentially, the only place you’d need to get coding – you can get the report you need, when you need it, backed by the confidence that the data it’s based on come from a managed service.
If you do need more than the basics, the usual AWS integrations are there too, including EventBridge notifications and CloudTrail logs.
There’s a few key things to consider when designing backups:
- Think about why you want backups and how you’ll restore them.
- Understand the information you hold, why you store it and for how long you need to protect it. It can be handy to define a recovery point objective (RPO) and recovery time objective (RTO), but understanding your actual needs is much more important than using the jargon.
- Make sure you’re familiar with any compliance requirements that apply to you, and how you’re meeting them (or not).
- Ensure you’re controlling access: both for how the backup system can make copies (ideally, with minimum necessary privileges), and also who / what can access the backup data.
- Decide what level of evidence you want about your ability to restore if needed.
When Apple launched Time Machine backups in 2006, their selling point was to make backing up so simple that people would actually do it. I’m going to be honest, AWS Backup is a bit more complex, but it’s still way more straightforward than rolling your own thing. If you have data that matter (and who doesn’t), and those data are on AWS, it’s definitely worth a look.
Not sure if your backups are up to scratch? Book a health check with one of our AWS experts.
This blog is written exclusively by The Scale Factory team. We do not accept external contributions.