Apply advanced routing rules
Introduction
Most production AI traffic is routed with a combination of fallback policies and traffic splitting. Those two patterns cover the common needs: keep requests successful when a provider fails, and distribute traffic deliberately for cost or evaluation. There is, however, a category of needs that neither tool addresses cleanly: routing decisions that depend on the request itself rather than on the static configuration of the chain, and lifecycle concerns such as decoupling application code from the exact provider model identifier in use today. These needs are covered by the platform's advanced routing surface. Advanced routing is not a single feature so much as a collection of capabilities that activate on different signals. Some are configured explicitly: in particular, model-name virtualisation, which lets the application speak a stable logical name while the gateway resolves it to whichever provider model is currently appropriate. Others are present by default and apply transparently to every request without configuration: session affinity, dynamic backend selection, and request and response transformation across provider APIs. This guide covers both: how to configure the surface that needs configuration, and what to expect from the surface that does not.
Persona: Developer working in the Agent Router Console.
Estimated time: 15--30 minutes, depending on whether model-name overrides and A/B or canary patterns are configured during the same session.
When this guide applies
Advanced routing is the right level to engage with when one or more of the following are true:
| Situation | Capability that addresses it |
|---|---|
| Application code keeps changing whenever the upstream provider releases a new model version | Model-name override: expose a stable logical model name and resolve it to the provider's identifier in the gateway |
| Two model versions need to be compared head-to-head under real production traffic | A/B with traffic splitting: combine model-name overrides with weight-based distribution |
| A new model version needs to be rolled out gradually to limit blast radius | Canary deployment: start with a small weight on the new model and shift over time |
| Multi-turn conversations or agent workflows are involved | Session affinity: handled automatically; no configuration is required |
| Backend load varies, and latency matters | Dynamic backend selection: handled automatically; no configuration is required |
| The application speaks OpenAI but should be free to route to any provider | Request and response transformation: handled automatically; no configuration is required |
For pure resilience or pure traffic distribution by weight, the simpler guides, Improve Resilience with Fallbacks and Reduce Cost with Traffic Splitting, are the right starting points. Advanced routing is layered on top of those patterns rather than replacing them.
Outcomes
By the end of this guide:
- At least one logical model name has been defined that decouples the application from a specific provider model identifier.
- The override has been combined with a traffic split to run an A/B between two model versions, or to operate a canary rollout for a new model version.
- The behaviours that operate automatically (session affinity, dynamic backend selection, and cross-provider transformation) are understood, even though they require no configuration.
Prerequisites
This guide builds on the routing configuration patterns established in the earlier dev guides. Specifically:
- A working API key with a routing configuration attached, as set up in Route Requests Across Providers.
- Familiarity with traffic splitting weights, as covered in Reduce Cost with Traffic Splitting. Model-version experiments are implemented as traffic splits over logical model names.
- At least two enabled models in the Admin Dashboard, or one model with multiple versions available for promotion.
Step 1: define a logical model name
The most direct piece of advanced routing to configure is the model-name override, exposed in the routing configuration as the modelNameOverride field: a mapping from a logical name the application speaks to the specific provider model identifier the gateway dispatches against. Without this, every provider model version bump becomes an application change; with it, version changes become a routing-configuration change inside the platform.
The override is attached to a route entry on the API key's routing configuration. The choice of logical name belongs to the application team and should be stable across versions. A few useful conventions:
| Logical name | Resolves to | When to use this pattern |
|---|---|---|
my-gpt4 | gpt-4o-2024-08-06 | A simple alias that hides the specific dated version from the application |
my-claude | claude-sonnet-4-20250514 | An alias that hides a cross-provider model identifier behind a stable name |
stable-chat | gpt-4o-mini-2024-07-18 | A long-lived alias used by parts of the system that prefer predictability over capability |
next-gen-chat | gpt-4o-2024-11-20 | A name reserved for whichever version is currently being evaluated for promotion |
Configure the override:
- Open the detail page for the API key whose routing configuration should expose the logical name.
- Open the routing configuration and add a route entry.
- Set the logical name in the route entry (for example,
my-gpt4). - Set the resolved provider model identifier (for example,
gpt-4o-2024-08-06). - Save the configuration.
The application now requests my-gpt4 in the model field of its OpenAI-compatible payload, and the gateway forwards the request to the configured provider model. The application code is unaware of the underlying version.
Step 2: combine overrides with traffic splitting
Logical names are most useful when they are paired with the traffic-splitting mechanics covered in the previous guide. Two routes can resolve the same logical name to two different provider models, and a weighted split distributes requests between them. The application speaks one stable name; the gateway runs an A/B in the background.
A typical A/B setup against the logical name chat-model:
| Route | Logical name | Resolves to | Weight |
|---|---|---|---|
| A | chat-model | gpt-4o-2024-08-06 | 50 |
| B | chat-model | gpt-4o-2024-11-20 | 50 |
Configure:
- In the routing configuration for the chosen API key, add two route entries with the same logical name and different resolved models.
- Switch the routing strategy to Traffic Splitting if it is not already set.
- Assign weights to each route entry (50/50 for an A/B; a heavier weight on the current version for an evaluation that should not perturb production much).
- Save and confirm the Active toggle is on.
Quality and performance comparisons are then made through Request Logs and Usage Analytics, both of which expose the resolved model per request even though the application only ever saw chat-model.
A canary deployment is structurally identical to an A/B but with intentionally lopsided weights. A common progression:
| Stage | Current version weight | New version weight |
|---|---|---|
| Initial canary | 95 | 5 |
| Hold-and-observe | 80 | 20 |
| Expand | 50 | 50 |
| Cutover | 0 | 100 |
The weights are adjusted in the Console at each stage; the application requires no change at any point. Once the cutover is complete, the route pointing at the old version can be removed entirely.
Route by compliance, cost, or latency policy
Weighted splits distribute one logical name across backends by chance. The same logical name can instead be resolved deterministically by keying the routing decision on attributes of the request, a policy condition rather than a weight. This is an application of the attribute-based dispatch already described: a routing rule inspects request attributes (for example, a tenant tag, a priority header, or a traffic classification) and selects the route entry whose condition matches, so that chat-model resolves to a different backend depending on the kind of request that arrived.
Three policy conditions cover the common cases:
| Condition on the request | Resolves chat-model to | Rationale |
|---|---|---|
| Tagged as regulated or residency-bound | An in-region, compliant provider | Keeps regulated traffic on a backend that satisfies data-residency and retention constraints |
| Tagged as low-priority or bulk | A cheaper backend | Reserves premium capacity for traffic that needs it and lowers cost on the rest |
| Tagged as interactive | A lower-latency backend | Protects the responsiveness of user-facing requests |
The route entries are configured exactly as in the preceding split, the same logical name mapped to different resolved models, but each entry carries a match condition instead of a weight, and the gateway dispatches to the first entry whose condition the request satisfies. Where cost is the driver, this attribute-based approach complements the proportional split in Reduce Cost with Traffic Splitting: the split distributes by chance, whereas a policy condition routes by a known property of the request. Compliance-driven routing is most often enforced on the operator side, where the residency and no-retention guarantees are configured for the backend itself; see Configure Data Residency and No-retention.
Step 3: understand the behaviour that requires no configuration
A meaningful portion of the platform's advanced routing happens behind the scenes. The behaviours below apply to every request and do not need to be enabled or tuned. This step is short on configuration but useful to read once, because the behaviour shapes how the gateway responds under load and across providers.
Dynamic backend selection
When more than one backend is eligible to serve a request (for example, two replicas of a self-hosted model behind a load-balanced endpoint), the gateway evaluates live backend metrics and chooses the best target. The mechanism is an InferencePool paired with an Endpoint Picker Provider: rather than relying solely on static weights or ordered fallback lists, the Endpoint Picker evaluates live metrics for each candidate and routes to the backend with the best capacity and cache affinity, reducing latency and improving throughput. Three signals contribute to the decision:
| Signal | What it captures |
|---|---|
| KV-cache usage | Memory pressure on each backend. Heavily loaded backends are deprioritised. |
| Queue depth | Number of pending requests on each backend. Less-loaded backends are preferred. |
| Prefix cache scoring | How well a backend's cache matches the request's prompt prefix. Better matches reduce latency. |
Dynamic selection operates only within the eligible set defined by the routing configuration. A fallback policy that limits requests to a specific provider still constrains dynamic selection to that provider; the policy boundary always takes precedence.
Session affinity
Multi-turn conversations, agent loops, and MCP sessions benefit from being processed by the same gateway instance throughout their lifetime, because state and cache accumulate locally. The gateway architecture guarantees this affinity:
- The proxy component and the external processor are deployed as a sidecar pair, so once a session is established, subsequent requests in that session route to the same processor.
- MCP sessions use encoded multi-backend session identifiers that pin the session to whichever combination of backends it was established against.
The affinity matters in three cases: multi-turn conversations where context accumulates on a specific backend, stateful agent interactions that maintain tool state across calls, and MCP sessions that manage connections to multiple tool servers.
No configuration is required to obtain this behaviour; it is a property of how the data plane is deployed.
Request and response transformation
The gateway exposes an OpenAI-compatible request surface but routes to a wide range of provider APIs that do not natively speak OpenAI. The translation between the two happens transparently in the gateway's processing pipeline:
| Transformation | Behaviour |
|---|---|
| Header mutations | Provider-specific authentication headers are set or replaced |
| Body mutations | JSON fields are added or rewritten, for example, injecting a default max_tokens if the provider requires one |
| Path rewriting | The OpenAI path is rewritten to the provider's native endpoint, for example, /v1/chat/completions becomes the Anthropic Messages endpoint at /anthropic/v1/messages |
| Model field rewriting | The model field is rewritten to the provider-specific model identifier, picking up any logical-name override from Step 1 |
| Response normalisation | The provider's response is translated back into the OpenAI-compatible shape the calling application expects |
The result is that an application written against the OpenAI SDK can route through the gateway to OpenAI, Anthropic, Google, Azure OpenAI, Mistral, or any other supported provider with no per-provider integration code.
Example transformation flow
When an OpenAI Chat Completions request is routed to Anthropic Claude, the fields are transformed in sequence:
- Path:
/v1/chat/completionsis rewritten to the Anthropic Messages endpoint. - Body: the
messagesarray is converted to the Anthropic message format, andmax_tokensis injected if absent. - Headers:
Authorization: Beareris replaced with the provider-specific auth header. - Model: the
modelfield is verified against the provider model identifier. - Response: the Anthropic response is translated back into the OpenAI Chat Completions format.
The calling application receives a response in the exact format it requested, regardless of which provider served it.
What to do next
- Use your own provider credentials: introduce BYOK credentials alongside the routes defined in this guide. The logical-name pattern composes naturally with BYOK, because the override is applied before credentials are selected. See Use Your Own Provider Credentials.
- Monitor traffic and usage: evaluate the results of an A/B or canary by examining the per-resolved-model breakdown in usage analytics. See Monitor Traffic and Usage.
- Export telemetry to an observability stack: push per-route metrics into an existing observability platform so quality comparisons can run alongside the rest of the application's data. See Export Telemetry to an Observability Stack.
Where to go next