You’ve built a brand new product and you’re going live soon, or you’re about to do a big marketing push and drive more traffic to your app. How do you know you’re prepared for these events?
In advance of your event, we’ll work to understand your Service Level Objectives around performance, availability, and latency. We’ll work together to understand an operational timeline, so that we know the time and extent of expected increases in traffic.
We’ll review your architecture and provide a detailed assessment of any SLO-affecting risks you’re carrying, scored by likelihood and impact. Where necessary to understand your risk profile, we’ll run load tests and simulate failure modes in order to understand how your platform responds under these conditions.
In order of priority we’ll advise on, or make changes to your infrastructure and code, to improve performance and mitigate those risks.
We’ll make sure your monitoring is set up properly, and that telemetry is made available to all relevant parties. We’ll help you understand how to interpret the graphs and metrics we collect.
We’ll also add alerting to key monitoring metrics, ensuring that we only sound the alarm when service levels are in danger of being breached, and when there’s something for a human operator to attend to.
Together, we’ll collaborate on putting together runbooks and playbooks for incident response. These will take into account your own systems, as well as those of any third parties you rely on to deliver your service.
We’ll help you put together an on-call rota, and rehearse on-call processes with participants. We’ll be available for you 24x7 during your event to help keep everything running smoothly.