Enterprise Tier
Set token, request, and concurrency rate limits
A single misbehaving consumer can degrade the platform for everyone. An agent framework caught in a retry loop hammers the gateway thousands of times a minute. A batch job opens hundreds of parallel connections and saturates the upstream provider's concurrency allowance. A new integration ships with no client-side throttling and sends traffic in bursts that the rest of the fleet then waits behind. Rate limits are the control that keeps any one consumer from consuming more than its share of finite capacity, so that the platform stays responsive under load and upstream providers are not pushed past their own quotas.
Tetrate Agent Router exposes three distinct rate limits, configured in the Admin Dashboard. Tokens per minute (TPM) caps how much model work a consumer can drive; requests per minute (RPM) caps how often it can call regardless of payload size; and a maximum-parallel-requests limit caps how many calls it can have in flight at once. Each addresses a different failure mode, and they are usually combined. This guide covers what each limit controls, how to apply limits at the model, API key, user, and group scopes, how limits at different scopes compose, what a caller experiences when a limit is hit, and how to confirm whether limits are actually being reached.
A rate limit is flow control, not a spend cap. It protects shared capacity by smoothing the rate of traffic; it does not stop a consumer from spending a budget over the course of a month. Spend caps are a separate control, covered in Working with Budgets and in Manage budgets for users and groups. The two are complementary: a rate limit bounds the worst-case burst, while a budget bounds the cumulative total. The difference is treated in more detail in Step 5 below.
Persona: Platform operator working in the Admin Dashboard.
Estimated time: 15--25 minutes for an initial pass; ongoing as workloads evolve.
When this guide applies
Rate limits are the right concern in any of these situations:
| Situation | Limit type that fits |
|---|---|
| A retry loop or runaway agent is sending far more requests than a healthy client would | Requests per minute (RPM) |
| A consumer drives large token volumes (long contexts and verbose completions) that strain upstream capacity | Tokens per minute (TPM) |
| A batch job opens many connections at once and saturates the provider's concurrency allowance | Maximum parallel requests |
| One team's traffic should never crowd out another team sharing the same deployment | A per-group limit across all of that group's keys |
| A new or untrusted integration should be bounded before its real traffic shape is known | A deliberately tight limit at the key scope |
| Spend over a month should be capped regardless of request rate | A budget, not a rate limit. See Working with Budgets |
Outcomes
By the end of this guide:
- The three limit types (TPM, RPM, and maximum parallel requests) are understood, along with the failure mode each one addresses.
- At least one rate limit is applied at the scope appropriate to the workload it protects.
- The rule for how limits at different scopes compose, the most restrictive applicable limit wins, is understood.
- The caller-side experience of a hit limit, an HTTP
429response, is understood, along with the back-off behaviour applications are expected to implement. - The distinction between a rate limit and a budget is clear, so the right control is reached for in each situation.
- A way to confirm whether limits are being hit is in place.
Prerequisites
- Administrator access to the Admin Dashboard: typically the
super_adminrole, or a role with permission to manage limits. - At least one provider and model already provisioned. See Provision models and providers.
- For per-group limits, the groups that traffic is attributed to. Group membership is sourced from the corporate identity provider, as described in Map Entra ID groups to business functions.
- Coordination with the developer teams that own the affected keys. Limits are enforced inline, so a misconfigured limit produces visible production impact.
Step 1: choose the limit type
The three limits answer different questions about a consumer's traffic. Pick the one, or the combination, that matches the capacity being protected.
| Limit | What it caps | The failure mode it addresses |
|---|---|---|
| Tokens per minute (TPM) | Total tokens processed per minute, input plus output | Heavy model work: long contexts and verbose completions that consume upstream throughput out of proportion to the request count |
| Requests per minute (RPM) | Number of requests per minute, regardless of size | High call frequency: retry storms, tight polling loops, and chatty agents that make many small calls |
| Maximum parallel requests | Number of requests in flight at the same moment | Concurrency exhaustion: batch jobs and fan-out patterns that open many simultaneous connections and saturate the provider's concurrency allowance |
TPM and RPM are rate measures; they constrain traffic averaged over a window of time. Maximum parallel requests is a concurrency measure; it constrains an instantaneous count and is unconcerned with rate. A consumer can stay well under its RPM yet still exhaust the parallel-request limit by issuing a single large burst, and the reverse is equally possible. Because they catch different shapes of traffic, the limits are usually applied together.
A reasonable default for a new workload is a moderate RPM to absorb retry storms, a TPM sized to the expected token volume, and a parallel-request limit that reflects how much fan-out the workload legitimately needs. The values are tightened or relaxed once real traffic reveals the workload's shape, the sizing approach is covered in Step 4.
Step 2: choose the scope
The same three limit types can be applied at several scopes. The scope determines which traffic the limit is measured against.
| Scope | What the limit governs | Typical use |
|---|---|---|
| Per model | All traffic to one model, across every consumer | Protecting a single upstream model or deployment from aggregate overload, independent of who is calling it |
| Per API key | All traffic presented with one key | Bounding a specific application or integration to its expected envelope |
| Per user | All traffic across every key a user owns | Holding an individual developer's total footprint in check, regardless of how many keys they hold |
| Per group | All traffic across every member of a group | Reserving a fair share of capacity for a business function, so one team does not crowd out another |
Per-key limits are the most precise and the most common, because a key usually maps to a single workload. Per-user and per-group limits sit above the key scope and govern aggregate footprint: a user's group is resolved from the identity provider at request time, so traffic is attributed to the right group automatically once the mapping in Map Entra ID groups to business functions is in place. Per-model limits are orthogonal to the others; they cap total load on an upstream model regardless of which consumer is driving it, and are the right tool when a particular model or deployment has a known capacity ceiling.
Limits are set from the Admin Dashboard surface that manages the relevant entity: the model, the key, the user, or the group. The fields presented are the three limit types from Step 1; a limit left unset at a given scope is simply not enforced at that scope.
Step 3: understand how scopes compose
A single request can fall under limits at more than one scope at once. A request made with a particular key, by a particular user, who belongs to a particular group, against a particular model is subject to the limits set at all four scopes simultaneously.
The composition rule is straightforward: every applicable limit is evaluated independently, and the request is rejected if it would breach any one of them. The most restrictive applicable limit is therefore the one that takes effect. A generous per-group allowance does not loosen a tight per-key limit, and a generous per-key limit does not override a tight per-model ceiling. The limits do not add together, and a higher-scope limit does not raise a lower one.
A worked example makes the rule concrete:
- A group is allowed 10,000 RPM across all its members.
- A key belonging to a member of that group is allowed 500 RPM.
- The model that key targets is allowed 2,000 RPM across all consumers.
Traffic on that key is held to 500 RPM, because the per-key limit is the most restrictive of the three that apply. If a second key in the same group also runs near its own limit, the two together are still held under the group's 10,000 RPM; and all consumers of the model together are held under the model's 2,000 RPM. Each ceiling is enforced at its own scope, and a request must satisfy all of them to pass.
The practical consequence is that the tightest limit governs. When a limit appears not to be taking effect, the usual cause is a tighter limit at another scope firing first. The way to confirm which limit is firing is covered in Step 6.
Step 4: size and adjust limits
Sizing is the part of this work that takes the most judgement. Two failure modes are worth avoiding.
- Limits set too tight. Normal traffic hits the ceiling, healthy clients start receiving
429responses, and the workload looks, from the consumer's side, as though the platform is failing. The cost of this failure mode is immediate and visible. - Limits set too loose. A runaway consumer is never actually constrained, and the limit becomes a number that never fires. The cost surfaces later, as overloaded upstreams or a degraded experience for other consumers sharing the same capacity.
The dependable approach is to size each limit from observed traffic. The expected peak is read from the usage surface, see Monitor traffic and usage, the highest legitimate per-minute value over a representative window is taken as the baseline, and the limit is set comfortably above that baseline but well below any level that would constitute overload. Setting a limit at roughly two to three times the observed peak is a common starting point, adjusted for how bursty the workload is.
For a brand-new workload with no traffic history, a deliberately tight limit is the safer starting point. Real traffic surfaces the true shape quickly, and the limit is relaxed once the baseline is known. Relaxing a limit that proved too tight is a low-risk adjustment; discovering that a generous limit allowed an overload is harder to recover from.
Limits are adjusted from the same surface used to set them. A change takes effect on subsequent requests, with no restart and no downtime window. Because the change is felt by live traffic, an adjustment to a production key is best coordinated with the team that owns it.
Some workloads are quiet most of the time and very loud occasionally: end-of-month batch runs and scheduled report generation. A limit sized for the quiet baseline will fire on the burst. The cleanest answer is usually to isolate the bursty work on its own key with its own, more generous limit, which keeps the steady-state key tight and the usage attribution clean. The patterns for those workloads are covered in Run batch and long-running jobs.
Step 5: distinguish rate limits from budgets
Rate limits and budgets are often confused because both impose a ceiling, but they control different things and fail in different ways.
| Aspect | Rate limit | Budget |
|---|---|---|
| What it controls | The rate and concurrency of traffic: TPM, RPM, and parallel requests | Cumulative spend over a period |
| What it protects | Shared capacity: the gateway and upstream providers | A financial ceiling |
| Time horizon | Per-minute window, or instantaneous for concurrency | A billing period: a day or a month |
| Effect when reached | Individual requests rejected with 429; traffic resumes the next window | Spend is capped for the period; access is curtailed until the period resets or the budget is raised |
| Right control for | Smoothing bursts, preventing overload, bounding a runaway in real time | Keeping cumulative cost under a contractual or departmental ceiling |
The two are complementary rather than interchangeable. A rate limit bounds the worst case in any given minute but says nothing about the total spent over a month; a consumer running steadily just under its rate limit can still exhaust a monthly budget. A budget bounds the monthly total but does nothing to prevent a single bad minute from overloading an upstream. Most workloads warrant both: a rate limit for capacity protection and a budget for spend control.
Budgets are configured separately. For the full treatment of spend caps at the user and group scopes, see Manage budgets for users and groups; for the broader budgeting discipline, see Working with Budgets.
Step 6: monitor whether limits are being hit
A limit that never fires and a limit that fires constantly are both worth knowing about: the first may be set too loosely to matter, and the second is likely throttling healthy traffic. Both are visible in the platform's usage and traffic surfaces.
The signal to watch for is the rate of 429 responses, broken down by the scope the limit is set on. A workflow that pairs well with this guide:
- Open the usage and traffic view. See Monitor traffic and usage.
- Apply a time range that matches the cadence of the workload under review.
- Break the traffic down by the scope the limit is set on: by key, by user, by group, or by model.
- Look at the proportion of requests returning
429. A small, occasional fraction during peaks is normal; a sustained high fraction indicates the limit is too tight for the legitimate workload. - Correlate the
429rate with the limit type. A spike concentrated in one limit type (a parallel-request limit firing while RPM stays clear, for instance) points to the specific control that needs adjustment.
When a limit is firing more than expected, the choices are to raise it if the traffic is legitimate, see Step 4, or to address the consumer if the traffic is not, by working with the owning team or, in the case of a suspected-compromised key, revoking it.
What a caller experiences when a limit is hit
When a request would breach an active limit, the gateway rejects it with an HTTP 429 Too Many Requests response rather than forwarding it upstream. The rejection is immediate and applies only to the request that crossed the threshold; once the window advances or in-flight requests drain, traffic flows again. A 429 is therefore a transient, recoverable signal, not a hard failure, though to a client that does not handle it, it presents the same way as an error.
Applications are expected to back off and retry rather than fail or retry immediately:
- Retry the request after a short delay rather than treating the
429as fatal. - Use exponential back-off with jitter (increasing the delay on each successive
429, with a small random offset) so that many clients hitting the limit at once do not retry in lockstep and re-create the burst. - Respect any retry-after guidance the response carries, where the client library surfaces it.
- Cap the number of retries so that a sustained limit does not turn into an unbounded retry loop, which is itself a source of the RPM pressure the limit exists to contain.
Most current AI client SDKs implement back-off of this kind by default, so a well-behaved application typically experiences a hit limit as added latency rather than as a visible error. The patterns matter most for custom integrations and for high-concurrency batch workloads, which are covered in Run batch and long-running jobs.
What to do next
- Set spend caps to complement the flow control established here. See Manage budgets for users and groups and Working with Budgets.
- Confirm group attribution is correct before relying on per-group limits. See Map Entra ID groups to business functions.
- Watch the effect of the limits on live traffic. See Monitor traffic and usage.
- Apply the back-off and isolation patterns to high-concurrency workloads. See Run batch and long-running jobs.
Where to go next