Does your SaaS offering provide a service to sets of end users? That’s especially common for business apps. For example, we use Slack at The Scale Factory for team communications, with a company workspace.
If you do group end users like that, you probably think about those groups as your customers, and as tenants of your platform. Your customers then have end users that log in and actually use the service.
In the cloud there are three common models for isolating different tenants (or not). You can duplicate all your infrastructure each time (“silo”), you can do the opposite and share all the infrastructure (“pooling”), or you find a middle ground between the two.
Whenever you’re doing less than full isolation on dedicated hardware, there’s a risk that high traffic for one tenant means low performance for someone else.
What does good look like? The reason I’ve asked that is to set the scene before talking about treatment. Maybe you’re already seeing isolation issues, maybe not. Either way, start by identifying the patterns of traffic from a typical tenant - is it constant, or spiky? If there are spikes, do different tenants tend to have them at the same time (maybe 9AM on a Monday)?
Along with an initial understanding of how end users are accessing your SaaS solution, the other starting point is an initial idea of what counts as too much. You’d probably frame this as an impact on other tenants. Maybe you have a formal SLA you want to meet, or maybe you’ve got error budgets and the work is around ensuring that you’re in budget.
Money matters. At large scales, it matters a lot. The prices and the way you charge for your services can have a big impact on your tenants’ choices and on end user behaviour. I’ve mentioned this first because it’s also something that’s much easier to change early on. The more customers you have on an existing payment model that they’re happy with, the harder it becomes to switch.
The two leading models for billing out SaaS are time-based subscriptions and metering by use. Metering by time includes charging by user, or charging a single fee to a tenant no matter how many end users they have. As a customer, you probably subscribe to something on this basis already. For example, AWS’ managed Kubernetes service charges a flat, per-hour fee for a highly available control plane. And, because it’s never that easy, you also pay per-use metering for the compute that your cluster runs. In that case, the more resources you assign, the more it costs you.
You can link your pricing to your underlying costs. This is a good fit where you have lots of tenants and you need a low-overhead link between your revenue and outgoings, for each tenant. The downside is that your customers now need to do their own capacity management - they might not be ready or able to do that.
A simpler option is to assign a baseline usage level for a customer and announce either rate limits or overage fees if your tenant exceeds it. The more focused your market, the more opportunity you have to manage that customer relationship and offer a grace period or negotiate a new contract before those measures actually kick in.
These economic incentives tie in to tiering. A simple model for that is a free trial, maybe limited to a couple of users, then a basic paid tier which pools all those tenants onto shared infrastructure, and an advanced tier where you deploy the same workload but using “silo” isolation: a new set of dedicated infrastructure dedicated to each tenant.
Accepting the load
Whether you charge more for it or not, your default option should be to accept the extra load. For your customers, SaaS often means not having to think about capacity and so your platform needs to make that possible. You can run a load test against your service to find the bottleneck (or, if your architecture is based on microservices: help each service team find its bottleneck, and prioritise the most important ones).
At The Scale Factory, we’re big believers in only making tech changes where that’s justified. Load tests help you find which component saturates first. Try scaling out the constrained part of your existing design before planning any work to reimplement it.
Well laid capacity plans still fall foul of traffic spikes and other issues. Pooled infrastructure saves on direct costs and reduces operational toil, but that pooling also amplifies the impact of a single tenant facing high load. The simplest option is direct throttling: rate-limit the work linked to that tenant. API design plays a big impact here; for example, if all your internal HTTP traffic includes a tenant ID, it’s easier for components to measure impact and apply restrictions.
There are a suite of architecture patterns that you can use instead of an outright block; for example, if you serve a web app, you can degrade it to a simpler version for the affected tenant, and protect your platform overall.
The simplest isolation model doesn’t really have noisy neighbours: the silo pattern. It’s often expensive to duplicate your whole stack for every tenant, but it works. For all the other cases, partitioning is your friend. Forget for now about different tiers; instead, just imagine that you split all your tenants into sets of 10, and each set of 10 tenants runs on a different shard (the same infrastructure, cloned lots of times). You have fewer replicas to manage compared to silo isolation, and a much smaller blast radius if one tenant does have an impact on their shard. At a large enough scale, techniques such as shuffle sharding are even more effective.
If all your tenants have similar traffic patterns, you stop there. However, when they don’t then you need to think about how you pack different tenants onto uniformly sized shards. What happens when that shard’s tenants grow their use to have an impact? (There’s no one right answer, but there definitely are good solutions).
Low level limits
Whenever different tenants’ workloads coexist, there’s potential for impact. You could decide to share compute between customers in aggregate, and still make sure that work for each tenant gets dedicate resources; for example, Kubernetes has a node-level CPU management option that lets you make sure that processing for one tenant never runs on the same CPU core as work for a different tenant. That can be a good fit if you’re doing something really sensitive to latency, such as live video.
If you’re serverless, one tenant’s heavy service use can use up all of your service quota for concurrency. Don’t forget that you can configure a different, higher service quota. Once you’ve taken care of that, you can go further and specify separate concurrency reservations for each tenant (or for sets of tenants). The extra effort to manage - and automate - these reservations pays off with a stronger guarantee that you can invoke the functions you need to, when you need to.
Humans vs. robots
A story we’ve heard a few times in our consulting work revolves around having human users seeing an impact from noninteractive API use. One person is clicking through your website, another person is using the mobile app, and someone else is running a script on their PC to upload the data for the last quarter.
I hope you’re not surprised that there’s no single right answer, but here are a couple that often work:
- provide two API endpoints. One endpoint is for interactive users (webapp, mobile, maybe desktop) with a low rate limit, maybe 2 requests per second. You can add an algorithm to allow more use in short bursts. The second endpoint is for API access; the API can be exactly the same, but you send the traffic to a different backend that you scale out separately, and the backoff mechanisms are appropriate for bulk access.
- request rate pricing. If many requests to your API translate to end user impact, use cost as a means to discourage it. You could also exclude bulk API access from your lowest pricing tiers.
Humans vs. humans
Sometimes, it really is bad behaviour. It’s probably not your actual customer misbehaving; maybe they didn’t protect a credential as well as they should have, and crooks are misusing that access. Because you’re potentially dealing with actual adversaries here, the best approach depends a lot on your existing controls, and on the wider context.
Remember that as well as keeping different tenants apart, you might need to protect different end-users from each other when they work at the same business. All cross-tenant controls should enhance, and not replace, security isolation between different end users.
Worried about noisy neighbours affecting the other customers for your SaaS business? Start by booking a free healthcheck with one of our AWS experts.
This blog is written exclusively by The Scale Factory team. We do not accept external contributions.