Multi-cloud architecture is an infrastructure configuration in which one organisation uses the services of more than one cloud provider. Historically, it also referred to the situation when cloud provider services were used in parallel with a company’s own data centre, but nowadays with developments in technologies and architectures, people usually call that setup “hybrid cloud”.
Using services from multiple cloud providers (or at least more than one) would require connecting them together. Everybody who has had an interaction with a big public cloud knows that even one cloud may be complicated enough and require lots of specialisation. For example, each big public cloud has its own skills certification path. Furthermore, as the products, as well as web consoles and tools differ significantly across cloud providers, such a situation can quickly start resembling spaghetti, or at least Tetris. So why would someone do this?
Usually, diversifying a set of service providers or suppliers is a consequence of the “do not put all eggs in one basket” approach: minimising the risk. This may be the case when diversifying the portfolio on the stock market or, for example, energy suppliers (from the macroeconomic perspective). In that way, in the first scenario risk of price fluctuation of single products are mitigated by other products’ price patterns, whereas in the second scenario political risks are minimised. But how does it correspond to the case of multi-cloud? Historically, it was meant to cut the risk for critical systems in the case when one of the cloud providers went down. At the beginning of the cloud developments there was a lot of uncertainty about the future reliability of this product, including how often it will go offline. Nowadays, when public cloud is already a sustainable market product with a history of time series data, allowing us to estimate the probability of a provider going down, it is safe to say this is not the case anymore.
In fact, any big cloud provider, including AWS, GCP and Azure can experience occasional interruptions. An example might be the recent flood in a Google’s Paris data centre (europe-west9-a), which led to some region-wide failures. Nevertheless, to reduce the risk of system failure, cloud providers place data centres in multiple locations world-wide. The biggest failure domain in clouds is called a region: a separate geographical area consisting of availability zones.
We nearly always recommend deploying SaaS workloads with an architecture running across several availability zones. Then, you can easily design your application in such a way, that when one instance fails, the traffic will be redirected to the instance in another zone. Ideally, availability zones are isolated and a failure in one zone doesn’t affect the others; the Google incident was a notable exception. If zone-level resilience is still not enough for you and you want to get insured from the failure of the entire region (let’s say, in the case of a war or an earthquake), you can replicate your infrastructure in different regions but still on the same provider.
By the way, having a disaster recovery plan to use a different region doesn’t imply doubling the cost, as there are different disaster recovery strategies available.
Three reasons not to use multi-cloud…
As in the end every architectural decision is a matter of cost-benefit analysis, even if implicitly, let’s look at the costs associated with multi-cloud. Or, simply put, three reasons not to use it.
1. Financial cost
When your system is distributed across different cloud providers, data transferred between the components needs to go through the public internet. It is not the case when using only one cloud provider - then, you usually pay lower cost associated with the usage of the cloud provider’s internal network rather than public Internet. Furthermore, multi-cloud means obviously more variety of services used, so cost optimizations are more difficult to plan and obtain. Cloud cost management is complex enough even in the case of one cloud provider. Using multiple cloud providers in parallel means one needs not only expertise on these services, but also interactions between the systems.
Another cost factor is building a multi-cloud team. Getting to know one cloud provider’s services takes time. For example, AWS has over ten certifications to offer to prove knowledge in different AWS areas. As multi-cloud is not a common architectural pattern, employees usually do not have an incentive to specialise in several cloud providers’ services. Therefore, people who can do multi-cloud well are in short supply. Even if you succeed in recruiting or upskilling the team, there is still a cost associated with coordinating specific teams’ work.
Maintaining several interconnected cloud provider architectures means increased likelihood of bugs being introduced. Additional layers of complexity make it harder to understand the system. In the case of multi-cloud, not only do we have several systems, but also system dependencies, and due to the shortage of multi-cloud experts it may be hard to predict how these will behave. Possible bugs might lead to lower reliability of the system, which, paradoxically, is the opposite to what multi-cloud is trying to achieve. For example, it is not easy to distribute load across different cloud providers, and communication between components in different clouds can significantly increase latency. The extra complexity may also lead to increased risk from security problems. Given that one security expert might not be enough in a multi-cloud environment, this would cause further complications.
3. Delivery timelines
The more complexity, the less likely the estimates are to come out right. As mentioned, you also probably need a larger engineering team, or even multiple teams. In such a setting, there is also a cost of integrating the work of cloud experts and developers reflected in the timelines. Each change or re-work now needs to be assessed in the context of several systems, as well as its dependencies, and requires lots of coordination. Going multi-cloud also affects times to onboard potential new hires to the project.
…and one reason why you might
Let us imagine that your SaaS company successfully operates on GCP and has customers in Europe, but wants to enter new markets, such as the China market. The problem is, your current cloud solutions provider doesn’t have any data centre in that geographical region. It doesn’t necessarily mean that you need to abandon your market expansion project, or switch completely to the different cloud provider with China region available (such as AWS). You can keep your current infrastructure on GCP, while using AWS to deploy only to the new region.
Regardless of your choice, it is always a good practice to make well-informed architectural decisions subject to cost-benefit analysis and understand reasons, as well as implications, no matter what the outcome is. So if you happen to end up with that option, make sure that you as well as your team understand what and why you are doing. Lack of understanding rationale behind complex and legacy systems may be a factor lowering team morale. Just watch out for situations where you’re doing - or considering - multi-cloud primarily due to sunk costs and path dependence, meaning a situation when historical reasons are the main driver for the solution (you may already have encountered the term “legacy” in the context of IT architecture, haven’t you?).
This blog is written exclusively by The Scale Factory team. We do not accept external contributions.