Linius contacted us ahead of a major sporting event they were supporting. Anticipating a substantial increase in traffic, we had four weeks to prepare for the event.
The Challenge
Linius and one of their partner firms had secured a deal that would launch for a major event in the 2019 sports calendar. With the event only weeks away Linius needed to be sure their API platform was resilient against likely risks and could scale to handle anticipated traffic.
Linius and their partner planned to provide a custom experience for each website visitor, making use of Linius’ video personalisation technology. When launched, visitors would search for events, and see a video automatically assembled from those results, along with advertising in the same stream.
Although some monitoring was set up, available visualisations did not provide a clear picture of the platform’s overall performance, nor of the impact that high-traffic tenants could have on the service.
Linius knew that their code worked well at its existing scale: tens of concurrent users. With early predictions for event traffic well above even their busiest tenant, Linius wanted assurance the platform as a whole could handle demand and remain within defined operational tolerances.
The AWS Infrastructure
The Linius platform is deployed into a single AWS region, and makes use of load balanced, autoscaled EC2 instances, Amazon Elasticsearch Service, Amazon SQS, and Amazon RDS for PostgreSQL.
Why The Scale Factory?
Linius were reassured by our scientific approach to load testing and platform improvements, as well as the fact that our team had delivered a similar piece of platform scaling work recently.
The Solution
We proposed a scheme of work to provide better visibility of the platform’s live performance, its capabilities and limits. Where these limits put the event outcome at risk, we would implement infrastructure improvements or advise on code changes.
Starting from existing load tests, we reviewed actual API traffic for the event website and developed a representative sequence of requests to simulate load from browser clients. After making sure that services were configured for fault tolerance and sized to suit load testing, it was time to start testing in earnest and identify failure points. Analysing the saturation points gave Linius the information they needed to prioritise code changes; notably, a release that eliminated one expensive database query entirely.
Linius’ existing platform was deployed into one AWS region across multiple availability zones. Even so, our risk analysis identified potential points of failure. For example, one component proved vulnerable to cluster degradation where it could fail to form consensus even though nodes remained in service.
The Results
Armed with the improved monitoring & dashboards, Linius could be confident that their systems would provide reliable service throughout the event, comfortably inside defined service levels.
With the improvements that we made directly, alongside code changes following our recommendations, the Linius platform had ample capacity for the actual load from the event.
Next Steps
We’ve left Linius with parcels of further prioritised work they can work through to improve platform performance and resilience.
They’re also looking at adopting a container-based deployment pipeline, along with a shift towards serverless compute and supporting services.