Sizing and scale
This page is a capacity reference for planning an Tetrate Agent Router deployment against a concrete scale target. It derives an approximate workload from a stated set of non-functional requirements, lists the dimensions that drive sizing, and describes the method used to translate those dimensions into a data-plane footprint. The figures are planning approximations intended to frame a load test; they are not capacity guarantees. Exact pod counts, replica sizes, and provider quotas should be validated against load testing of the platform on the target infrastructure before any number is committed. The architecture referenced throughout is described in Architecture Overview; the customer-managed data plane contains the Controller and the request proxy, while the Tetrate-hosted management plane stores routing rules, policies, and user configuration.
Reference workload
The reference target for this deployment is:
| Dimension | Target |
|---|---|
| Users | ~60,000 |
| Models | ~25 |
| LLM calls per month | ~1.9 million |
A monthly call volume converts to an average request rate as follows. A 30-day month contains roughly 2.6 million seconds (30 × 24 × 3600). Dividing 1.9 million calls by that figure gives an average of approximately 0.7 requests per second across the whole month.
| Derived quantity | Approximate value |
|---|---|
| Calls per month | 1,900,000 |
| Seconds per 30-day month | 2,592,000 |
| Average requests per second | ~0.7 |
An average of under one request per second is a modest steady-state load. Provisioning to the average would be a mistake. Real traffic from 60,000 users is not uniform across the month; it concentrates into working hours, time zones, and bursts driven by application behaviour. Peak concurrency is the figure that sizes the deployment, and it can sit one to two orders of magnitude above the monthly average.
The peaking factor, the ratio of peak rate to average rate, is the single most important unknown in this exercise. It depends on usage patterns that only the field and SME team can confirm for this engagement. A common planning approach is to assume traffic concentrates into a fraction of the day and to size against that window rather than the 30-day average:
| Concentration assumption | Effective window | Approximate peak requests per second |
|---|---|---|
| Traffic spread evenly across 30 days | 2,592,000 s | ~0.7 |
| Traffic within an 8-hour working day, 22 working days | 633,600 s | ~3 |
| Traffic within a 2-hour daily peak, 22 working days | 158,400 s | ~12 |
| Bursty interactive load (short spikes) | Not applicable | well above the figures above |
These rows illustrate sensitivity to assumptions, not a prediction. The correct peaking factor for this engagement must be supplied by the field/SME team and confirmed against observed traffic before sizing is finalised.
Sizing dimensions
Request rate alone does not size a deployment. The dimensions below jointly determine the data-plane footprint and the provider-side quota required.
| Dimension | Why it matters |
|---|---|
| Concurrent requests | The number of in-flight requests at peak, not the monthly count, drives memory and connection-pool sizing. Long-running streaming requests hold resources for their full duration |
| Tokens per request | Larger prompts and completions increase per-request CPU, memory, and provider latency. Token distribution, not just request count, determines provider throughput and cost |
| Streaming vs. non-streaming | Streaming responses hold a connection open for the full generation. A workload that is predominantly streaming sustains far more concurrent connections than the same request rate served as discrete responses |
| Number of models and providers | The ~25 models map to a set of upstream providers, each with its own rate limits, latency profile, and quota. Routing and fallback configuration grows with this count |
| Regions | A multi-region deployment multiplies the data-plane footprint and introduces cross-region latency. Each region is sized for its own share of peak concurrency |
Data-plane sizing
The data plane scales horizontally. Both the Controller and the routing proxy add capacity by adding replicas rather than by enlarging a single instance.
- The Agent Router (the proxy component) handles the request data path. Its capacity is governed by peak concurrent connections and per-request token volume. Replicas are added behind a load balancer to absorb peak concurrency, with headroom held in reserve above the expected peak.
- The Controller manages configuration and the control path. It scales with the number of distinct routing rules, API keys, models, and providers rather than with raw request rate.
The sizing method is the same regardless of the target:
- Establish the peak concurrent-request figure from the average rate and the agreed peaking factor.
- Characterise the workload mix (token distribution and the streaming/non-streaming split) because these set per-connection resource cost.
- Measure the throughput of a single Agent Router replica on the target infrastructure under a representative load profile.
- Divide peak concurrency by per-replica throughput to obtain a replica count, then add headroom for failover, rolling upgrades, and traffic above the modelled peak.
Per-replica throughput is a function of the instance type, the workload mix, and the platform version. For that reason this page does not state a per-pod request rate. The figure must be measured against the deployment, not assumed.
Throughput and latency
The gateway adds minimal overhead to each request. End-to-end latency is dominated by the upstream provider's model-inference time, which is typically orders of magnitude larger than the gateway's internal processing. Sizing for latency is therefore mostly a question of provider behaviour and provider-side quota, not of gateway capacity.
| Latency component | Relative contribution | Notes |
|---|---|---|
| Gateway processing | Small | Routing resolution, policy evaluation, and normalisation. Observable as the difference between gateway latency and backend latency in the metric stream |
| Provider inference | Dominant | Time the upstream provider spends generating the response. Scales with output token count and provider load |
| Network | Variable | Cross-region hops add to this component in a multi-region deployment |
The metric families that separate these components (request latency, backend latency, and time to first token) are documented in OTel Metrics. Because provider latency dominates, the practical latency ceiling at a given concurrency is set by how many in-flight requests the chosen providers can serve without queuing, which ties back to provider-side rate limits.
The gateway's normalisation, fallback, and error semantics under load are described in Gateway Behavior. Fallback chains add latency on the failed leg of a walk and should be accounted for when modelling worst-case latency at peak.
Headroom for rate limits and budgets
At 60,000 users the platform's own rate-limit and budget controls are a sizing concern in their own right, separate from data-plane capacity.
- Per-key and per-group rate limits should be set with headroom above expected peak so that legitimate peak traffic is not throttled, while still bounding runaway clients. Limit configuration is covered in Set Rate Limits.
- Upstream provider quotas are a hard ceiling that the platform cannot exceed. The sum of configured per-key limits across all keys should be reconciled against the aggregate provider quota for each provider, with headroom retained for fallback traffic redirected from a degraded provider.
- Budget controls bound spend rather than throughput, but at this user count a budget set too tight will reject traffic that capacity could otherwise serve. Budgets and limits should be modelled together.
Observability at scale
Telemetry volume and metric cardinality grow with users, models, and API keys. The metric and trace schema is documented in OTel Metrics; the export configuration is covered in Export Telemetry to an Observability Stack.
Two cardinality drivers matter at this scale:
- High-cardinality metric labels (
api_key_idandrequested_model) multiply the number of distinct time series the metrics backend must store. With ~25 models and a large key population, per-key label cardinality is the dominant cost. Where the backend cannot absorb it, the per-key dimensions can be left out of the exported labels and read instead from the in-Console Usage Analytics surface, which aggregates them internally. - Trace volume scales directly with request rate. At peak, exporting every trace can overwhelm the destination backend and inflate cost. Trace sampling below 100 % is the standard mitigation for high-volume deployments. Metrics are not sampled and remain the complete record; sampling affects only the trace stream. Sampling behaviour is detailed in OTel Metrics.
The observability backend should be sized for peak time-series count and peak trace ingest, using the same peaking factor applied to the data plane.
Capacity questions to confirm
The figures on this page rest on assumptions that the field and SME team should confirm before any number is committed:
- What is the peaking factor (the ratio of peak request rate to the monthly average) for this engagement, and over what window does traffic concentrate?
- What fraction of requests is streaming, and what is the expected token distribution per request?
- How many upstream providers serve the ~25 models, and what is the rate limit and quota for each?
- Is the deployment single-region or multi-region, and how is the 60,000-user population distributed across regions?
- What infrastructure (instance types, node sizing) will host the data plane, so that per-replica throughput can be measured rather than assumed?
- What is the metrics backend's tolerance for high-cardinality labels, and what trace sampling rate is acceptable at peak?
- What headroom margin above modelled peak is required for failover and rolling upgrades?
Each answer narrows a planning approximation toward a number that can be validated by a load test against the target environment.
Related
- Architecture Overview: data plane and management plane components
- Gateway Behavior: error semantics and per-request data under load
- OTel Metrics: metric families, labels, and sampling
- Set Rate Limits: per-key and per-group limit configuration
- Export Telemetry to an Observability Stack: telemetry export configuration
Where to go next