Plan high availability and disaster recovery
A gateway that sits in front of every AI request is a shared dependency for everything behind it. When the gateway is unavailable, no application that routes through it can reach a model, and the failure is visible to every team at once. Planning for high availability (HA) and disaster recovery (DR) is therefore a requirement for a production deployment of Tetrate Agent Router. It determines whether a regional outage degrades the service or takes the whole organisation's AI traffic down with it.
The platform is built to be operated this way. The data plane, the customer-managed Controller and Proxy, scales horizontally rather than vertically, so capacity and redundancy are added by adding replicas. Several regional data planes can be run active-active behind regional load balancing, so the loss of one region degrades rather than stops the service. Routing is health-aware, so unhealthy backends and unhealthy regions are taken out of rotation without operator intervention. What the platform does not do automatically is decide how much loss is acceptable, prove that a restore actually works, or confirm that the deployment meets its agreed throughput targets. Those are planning and validation tasks, and they are the subject of this guide.
This guide covers why HA and DR planning matters for a production gateway, the building blocks the platform provides, how to set recovery point objective (RPO) and recovery time objective (RTO) targets, how to write and test a DR runbook with at least one restore, and how to load-test the deployment against agreed throughput and concurrency targets while recording p95 latency and error rate.
Persona: Platform operator or site reliability engineer (SRE) responsible for the production deployment.
Estimated time: Half a day to draft the targets and the runbook; a separate scheduled window for the load test and the restore test, since both involve live infrastructure.
When this guide applies
This guide is relevant whenever the deployment is heading toward, or already in, production use:
| Situation | What it covers |
|---|---|
| A production deployment is being designed and resilience targets have not yet been agreed | Defining RPO and RTO and choosing a regional topology |
| The deployment is single-region and a regional failure would stop all AI traffic | The multi-region active-active option and health-aware failover |
| Capacity has been sized on paper but never validated under load | Load testing against agreed throughput and concurrency targets |
| A DR plan exists on paper but no restore has ever been performed | Writing and testing a DR runbook |
| An evaluation criterion requires documented and tested RPO/RTO and load-test evidence | The full sequence below, which produces that evidence |
For a short-lived proof-of-concept that will never carry production traffic, the planning here can be scoped down, but the method does not change, only the targets do.
Outcomes
By the end of this guide:
- The building blocks the platform provides for HA (horizontal scaling, multi-region active-active operation, and health-aware routing) are understood, along with the guides that configure them.
- RPO and RTO targets have been agreed with the field team and written down.
- A DR runbook covering configuration and state backup, failover, and restore has been drafted.
- At least one restore has been performed against the runbook, and the result has been recorded.
- A load test has been run against agreed throughput and concurrency targets, and p95 latency and error rate have been captured.
Prerequisites
- Administrator access to the Admin Dashboard for the deployment, and the Kubernetes context for each data-plane region.
- An understanding of the deployment topology described in Architecture Overview: which components are customer-managed and which are Tetrate-hosted.
- The capacity-planning method in Sizing and Scale, which supplies the throughput and concurrency figures the load test validates.
- Agreement from the field and subject-matter-expert (SME) team on the throughput, concurrency, RPO, and RTO targets; these are engagement-specific and cannot be assumed.
- A GitOps or manifest workflow for the data plane, as described in Kubernetes Resources for GitOps, so that configuration can be restored from source rather than reconstructed by hand.
Step 1: understand the building blocks the platform provides
Three platform capabilities underpin every HA and DR plan. The plan combines them; it does not invent resilience the platform does not already offer.
- Horizontal scaling of the data plane. Both the Controller and the gateway proxy add capacity by adding replicas behind a load balancer rather than by enlarging a single instance. Running more than one replica of each is what makes the data plane survive the loss of an individual pod or node. The replica-count method (measure per-replica throughput, divide peak concurrency by it, then add headroom for failover and rolling upgrades) is described in Sizing and Scale.
- Multi-region active-active operation. Several regional data planes can serve the same traffic at once, each sized for its own share of peak concurrency plus headroom to absorb a failed peer. Running regions active-active rather than active-passive means there is no cold standby to spin up during an incident; the surviving regions are already serving traffic. Coordinating configuration across regional instances is covered in Run Multiple Platform Instances.
- Health-aware routing and failover. The platform selects backends and regional deployments per request using live health signals, taking unhealthy endpoints out of rotation automatically. Pooling regional deployments of one model and observing how traffic redistributes when a region is drained is covered in Load-Balance Across Regional Deployments.
A resilient deployment uses all three: enough replicas per region to survive node loss, enough regions to survive a regional outage, and health-aware routing to make the failover automatic rather than manual.
Step 2: define RPO and RTO targets
Recovery point objective (RPO) and recovery time objective (RTO) are the two numbers that turn "the platform should be resilient" into a testable requirement.
- RPO is the maximum acceptable amount of data loss, measured as a span of time. An RPO of one hour means that, after a failure, the deployment may be restored to a state no older than one hour, so configuration and state must be backed up at least that often. For a gateway, the state at risk is the routing rules, policies, API keys, and user configuration, not the in-flight requests themselves.
- RTO is the maximum acceptable time to restore service after a failure. An RTO of fifteen minutes means service must be back within fifteen minutes of the failure being detected.
The platform's architecture shapes what these targets attach to. The management plane is Tetrate-hosted and stores the routing rules, policies, and user configuration; the data plane is customer-managed and processes traffic. A regional data-plane failure in an active-active topology is absorbed by the surviving regions, so its effective RTO is bounded by how quickly health-aware routing sheds the failed region rather than by any manual restore. A loss of customer-managed configuration, by contrast, is governed by how recently that configuration was backed up (which is the RPO) and how quickly it can be reapplied (which is the RTO).
Targets are agreed with the field team, not chosen unilaterally, because they trade directly against cost: a tighter RPO means more frequent backups, and a tighter RTO usually means more standing capacity. Record the agreed numbers alongside the deployment's other non-functional requirements so the rest of the plan can be measured against them.
Step 3: back up configuration and state
A restore is only possible if there is something to restore from. Two categories of state are backed up.
- Capture the data-plane configuration from its source of truth. When the data plane is managed through GitOps as described in Kubernetes Resources for GitOps, the manifests in version control are themselves the backup; the cluster can be rebuilt from the repository. Confirm that every manifest the running deployment depends on is committed, and that no configuration has been applied out of band that would be lost in a rebuild.
- Capture the platform configuration that lives in the management plane: routing rules, policies, API keys, and user records. Export or snapshot this on a schedule no longer than the agreed RPO, so that a restore lands within the data-loss budget.
The backup cadence is driven by the RPO from Step 2: if the RPO is one hour, the most stale backup at any moment must be no more than one hour old. Store backups outside the region they protect, so that a regional incident does not take the backup with it.
Step 4: write the DR runbook
A DR runbook is the document an on-call operator follows during an incident, when there is no time to work things out. It covers three procedures.
- Failover. The steps to shift traffic away from a failed region or component. In an active-active topology, health-aware routing performs most of this automatically; the runbook records what the operator confirms (that traffic has shifted, that the surviving regions are within capacity) and any manual action needed if automatic failover does not fully cover the failure.
- Restore. The steps to rebuild a failed region or reapply lost configuration from the backups taken in Step 3: reapplying the GitOps manifests, restoring the management-plane configuration snapshot, and verifying the rebuilt deployment serves traffic correctly.
- Verification. The checks that confirm service is genuinely healthy after failover or restore, not merely reachable: a representative request succeeds end to end, error rate has returned to baseline, and the affected region is back in rotation.
Write each procedure as numbered, ordered steps that name the exact commands, surfaces, and expected results. The test of a good runbook is whether an operator who did not write it can follow it under pressure.
Step 5: test the runbook with at least one restore
A runbook that has never been exercised is untested. Restore tests routinely surface a missing backup, a manifest that was never committed, or a step that assumed access the on-call operator does not have. A documented RPO and RTO are only credible once at least one restore has actually been performed against them.
- Schedule a restore test in a non-production environment, or in a maintenance window where a controlled failure is acceptable. Coordinating disruptive tests against the right instance is covered in Run Multiple Platform Instances.
- Simulate the failure the runbook is written for: for example, by tearing down a region's data plane or by starting from an empty cluster.
- Follow the restore procedure from Step 4 exactly as written, without improvising. Where a step does not work as documented, the runbook is corrected, not worked around.
- Time the restore from simulated failure to verified-healthy service, and compare it against the RTO. Confirm that the restored state is within the RPO: no configuration newer than the last backup was expected to survive.
- Record the outcome: the date, the scenario, the measured restore time, whether RPO and RTO were met, and any runbook corrections made. This record is the evidence that the DR plan has been tested.
Repeating the restore test on a schedule, and after any significant change to the deployment, keeps the runbook from drifting out of date as the platform evolves.
Step 6: load-test against agreed throughput and concurrency targets
Capacity planned on paper is an estimate until it is validated under load. A load test confirms that the deployment sustains the agreed throughput and concurrency while keeping latency and error rate within bounds, and it is the evidence that horizontal scaling actually delivers the planned capacity.
- Agree the targets with the field team: the peak requests per second, the peak concurrent in-flight requests, and the workload mix: token distribution and the streaming-versus-non-streaming split. The method for deriving these figures, and the reasons concurrency rather than monthly average drives sizing, are in Sizing and Scale.
- Build a load profile that reproduces the agreed mix against a representative set of models, rather than a single trivial request repeated, so that the test exercises the same per-request cost the real workload will.
- Drive the load against the deployment, ramping up to the agreed peak and holding it long enough for the data plane to reach steady state under sustained concurrency.
- Record the results that the criterion calls for: the 95th-percentile (p95) request latency and the error rate at the agreed peak. Because provider inference dominates end-to-end latency, separate the gateway's own contribution from provider time using the latency metric families described in Sizing and Scale, so that a latency result is not misattributed to the gateway when it originates upstream.
- Confirm horizontal scaling under the same test: add Agent Router replicas and verify that sustained throughput rises and per-replica load falls, demonstrating that capacity is added by scaling out rather than by enlarging a single instance.
A load test passes when the deployment holds the agreed peak with p95 latency and error rate within the agreed bounds, and the result is recorded alongside the RPO/RTO evidence as the deployment's validated capacity.
What to do next
- Validate sizing assumptions. The load test in Step 6 depends on agreed throughput, concurrency, and peaking-factor figures. Confirm the open capacity questions before committing any number. See Sizing and Scale.
- Coordinate the regional fleet. Multi-region active-active operation is administered per instance, and configuration parity across regions is an operational discipline in its own right. See Run Multiple Platform Instances.
- Rehearse a regional drain. Exercising health-aware failover by draining a region and observing the redistribution complements the restore test in Step 5. See Load-Balance Across Regional Deployments.
Where to go next