Skip to main content

Load-balance across regional deployments

A single model is rarely served from a single place. The same model is frequently available from several regional deployments at once (an Azure OpenAI deployment in West Europe and another in East US, a self-hosted model running in two Kubernetes clusters on different continents, or a managed provider exposed through endpoints in three regions for capacity reasons). Treating those deployments as separate backends works, but it leaves throughput, latency, and resilience on the table. Tetrate Agent Router addresses this by letting several regional deployments of the same model sit behind one logical model and balancing requests across them at the data plane, using the Kubernetes Gateway API inference extension: an InferencePool that groups the regional endpoints, paired with an Endpoint Picker that chooses among them per request using live backend signals rather than fixed weights.


Persona: Platform operator working in the Admin Dashboard and the Kubernetes data plane.

Estimated time: 30--60 minutes, depending on how many regional backends are involved and whether a maintenance drain is rehearsed during the same session.

When this guide applies

This guide is relevant when the same model is reachable through more than one regional deployment and the operator wants the platform to use all of them:

SituationWhat it covers
One model is available from two or more regional deployments and only one is currently in useGrouping the deployments into a single InferencePool behind one logical model
A single deployment's rate limit or capacity caps aggregate throughputHow pooled endpoints raise the ceiling and how the Endpoint Picker spreads load to stay under per-endpoint limits
Callers are spread across geographies and latency varies by originHow proximity-aware and load-aware selection lowers tail latency
A regional incident takes down one deployment and traffic must stay on the survivorsHealth-aware selection and how it relates to fallback policies
A region needs to be taken offline for maintenance without dropping requestsDraining a region and confirming the redistribution

For a model served from a single place, this guide does not apply; there is nothing to balance, and a plain backend definition is sufficient. The pattern becomes worthwhile only once a second deployment of the same model exists.

Outcomes

By the end of this guide:

  • Several regional deployments of one model have been grouped into a single InferencePool exposed as one logical model.
  • The Endpoint Picker's role in per-request selection across the pooled endpoints is understood, along with the live signals it acts on.
  • The interaction between in-pool balancing, fallback policies, and traffic splitting is clear, including which layer acts first.
  • A region has been drained for maintenance, and the resulting shift in distribution has been observed.

Prerequisites

  • Administrator access to the Admin Dashboard, typically the super_admin role.
  • An Enterprise deployment, since pooling regional backends operates against the Kubernetes data plane. The pool definition lives in the data plane manifests, edited through whatever GitOps or manual process the deployment uses; the logical model that fronts it is managed in the Admin Dashboard. The single-versus-multi management split is described in Run multiple platform instances.
  • At least two regional deployments of the same model, each reachable from the cluster where the data plane runs. Connecting provider subscriptions in more than one region or cloud is covered in Connect provider subscriptions across clouds.
  • The Kubernetes context required to apply InferencePool and Endpoint Picker resources to the data plane.

Step 1: decide which deployments belong in one pool

A pool only makes sense for deployments that are interchangeable from the caller's point of view. The defining test is whether a request served by any member of the pool returns an equivalent result. Two Azure OpenAI deployments of the same model version in different regions pass this test; the same model from two different providers does not, because the response shape, behaviour, and identifiers diverge; that is a fallback or traffic-splitting concern, not a pool.

Three considerations decide pool membership:

  • Model equivalence. Every endpoint in the pool should serve the same model and, where it matters, the same model version. A pool that mixes versions silently turns load balancing into an uncontrolled A/B test; keep version experiments to the traffic-splitting mechanism described in Apply advanced routing rules.
  • Regional spread. The deployments should fail independently and sit at different distances from the caller population. Two endpoints in the same availability zone add capacity but little resilience; two endpoints in different regions add both.
  • Capacity and limits. Each endpoint carries its own rate limit and capacity. The aggregate ceiling of the pool is the sum of the members' limits, which is the throughput argument for pooling, but only if the Endpoint Picker spreads load rather than saturating one member first.

The output of this step is a short list: one logical model name, and the set of regional endpoints that will stand behind it.

Step 2: group the regional endpoints into an InferencePool

The platform balances across endpoints using the Kubernetes Gateway API inference extension. Rather than a static list of weighted backends, the regional deployments are grouped into an InferencePool, a set of endpoints serving the same model, and an Endpoint Picker chooses among them on every request. This is the same dynamic-selection mechanism described in Apply advanced routing rules; here it is applied deliberately across regional deployments rather than across replicas of a single deployment.

The InferencePool references the endpoints and names the Endpoint Picker that scores them. The exact resource shape depends on the inference-extension version installed with the data plane; the manifest below is illustrative of the structure rather than a literal field reference:

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: chat-model-regional
spec:
# The endpoints that serve this model in each region.
# Selector and endpoint wiring follow the inference-extension
# version installed with the data plane.
selector:
app: chat-model
extensionRef:
# The Endpoint Picker that scores candidates per request.
name: chat-model-endpoint-picker

Apply the pool and its Endpoint Picker through the data plane's normal change process:

  1. Define the InferencePool that groups the regional endpoints identified in Step 1.
  2. Reference the Endpoint Picker that will score those endpoints per request.
  3. Apply the manifests to the cluster through the GitOps or manual workflow the deployment uses.
  4. Confirm the pool reports its member endpoints as registered before any traffic is routed to it.

Once the pool is healthy, it is exposed to the rest of the platform as a single logical model.

Step 3: front the pool with one logical model

Callers should not address regions directly. The pool is presented to developers as one logical model in the Admin Dashboard catalogue, and the data plane resolves that name to the pool. The logical-model pattern is the same one used for version virtualisation in Apply advanced routing rules, a stable name the application speaks, resolved by the gateway, applied here so that a single name fans out across regions.

  1. In the Admin Dashboard model catalogue, confirm the logical model that maps to the InferencePool is enabled.
  2. Verify that the logical model resolves to the pool rather than to any single regional endpoint.
  3. Confirm developers reference only the logical model name in the model field of their requests.

With this in place, a request for the logical model is handed to the InferencePool, and the Endpoint Picker decides which regional endpoint serves it. No application or API-key change is needed when regions are later added to or removed from the pool; the logical name stays constant.

Step 4: understand how the Endpoint Picker distributes requests

The Endpoint Picker is what makes pooling more than a round-robin. On each request it evaluates live metrics for every candidate endpoint and routes to the one with the best capacity and cache affinity, which lowers latency and raises throughput compared with static weights. Three signals drive the decision:

SignalWhat it captures
KV-cache usageMemory pressure on each endpoint. Heavily loaded endpoints are deprioritised.
Queue depthNumber of pending requests on each endpoint. Less-loaded endpoints are preferred.
Prefix cache scoringHow well an endpoint's cache matches the request's prompt prefix. Better matches reduce latency.

Two consequences follow from this for regional pools specifically:

  • Throughput. Because the picker steers away from endpoints that are filling up, the pool absorbs more aggregate load before any single endpoint hits its rate limit. The effective ceiling approaches the sum of the members' limits rather than the limit of whichever endpoint was chosen first.
  • Latency. An endpoint that is geographically distant or already saturated tends to report higher queue depth and weaker cache affinity for local traffic, so the picker naturally favours the closer, warmer endpoint for a given caller, without any explicit geographic rule.

Selection operates only within the pool. The Endpoint Picker never routes outside the set of endpoints in the InferencePool; widening or narrowing the candidate set is a change to pool membership, not to the picker.

Step 5: layer fallback and traffic splitting correctly

In-pool balancing and the routing patterns from the developer guides are not alternatives; they stack, and the order in which they act matters.

  • In-pool balancing acts first. For a request routed to the logical model, the Endpoint Picker selects among the healthy regional endpoints inside the pool. This is the inner loop, and it handles the common case: one endpoint is busier or further away, so another serves the request.
  • Fallback acts second. A fallback policy, as described in Improve resilience with fallbacks, is the outer loop. The pool is treated as a single backend in the fallback chain. Only when the pool as a whole cannot serve a request (every regional endpoint is unhealthy or rate-limited) does the gateway walk to the next backend in the chain, which is typically a different provider or a different model class. In-pool balancing absorbs single-region trouble; cross-provider fallback absorbs whole-pool trouble.
  • Traffic splitting sits alongside. A weighted split, as covered in Apply advanced routing rules, selects which logical backend a request is sent to. When the selected backend is a pooled logical model, the Endpoint Picker then balances within it. A split can therefore send a percentage of traffic to a regional pool and the remainder elsewhere, with intra-pool balancing applied to the pool's share.

The mental model is three nested layers: traffic splitting chooses the logical backend by weight, fallback orders the logical backends for failure, and the Endpoint Picker balances among the physical endpoints inside a pooled backend. Each layer is configured independently and the boundaries do not blur; the picker never crosses a pool edge, and fallback never reaches inside a pool to pick an endpoint.

Step 6: rely on health-aware selection and drain a region for maintenance

The Endpoint Picker's signals double as a health filter. An endpoint that stops responding, returns errors, or reports saturation is deprioritised or excluded automatically, so a regional incident shifts load onto the survivors without operator action and without a fallback event at the chain level. This is the difference between in-pool resilience and chain-level fallback: a single failed region is handled silently inside the pool, whereas fallback is reserved for the case where the whole pool is unavailable.

Planned maintenance uses the same machinery deliberately. Draining a region means removing its endpoint from the pool's eligible set so that in-flight requests complete while new requests are steered elsewhere:

  1. Identify the regional endpoint to be taken offline.
  2. Remove or cordon that endpoint in the InferencePool through the data plane's change process, so the Endpoint Picker stops selecting it for new requests.
  3. Allow in-flight requests on the drained endpoint to complete rather than terminating them abruptly.
  4. Confirm the remaining endpoints absorb the redistributed load and that aggregate latency and error rate stay within the expected band.
  5. Perform the maintenance, then return the endpoint to the pool and confirm the Endpoint Picker resumes selecting it.

Because callers only ever address the logical model, a drain and a restore are both invisible to the application; the only observable effect is a shift in which region serves each request.

Step 7: observe the distribution

A balancing policy that is not observed cannot be tuned. Request Logs and usage analytics record the resolved backend for each request, which for a pooled logical model reveals which regional endpoint actually served it.

  1. In the Console, open Monitoring → Request Logs.
  2. Filter to traffic for the logical model that fronts the pool.
  3. Inspect the resolved backend per request to confirm requests are spread across the regional endpoints rather than concentrated on one.
  4. During a drain, confirm the drained region stops appearing as a resolved backend while the survivors pick up its share.

The healthy steady state is a spread across endpoints that tracks caller geography and per-endpoint load, not a perfectly even split, since the Endpoint Picker optimises for latency and capacity rather than for equal counts. A sudden collapse onto a single endpoint, or a cluster of chain-level fallback events, indicates that one or more regional endpoints have become unhealthy and is worth correlating against the provider's regional status. Richer filtering by time range, status, and resolved backend is covered in Monitor traffic and usage.

What to do next