Sizing and scale

This page is a capacity reference for planning an Tetrate Agent Router deployment against a concrete scale target. It derives an approximate workload from a stated set of non-functional requirements, lists the dimensions that drive sizing, and describes the method used to translate those dimensions into a data-plane footprint. The figures are planning approximations intended to frame a load test; they are not capacity guarantees. Exact pod counts, replica sizes, and provider quotas should be validated against load testing of the platform on the target infrastructure before any number is committed. The architecture referenced throughout is described in Architecture Overview; the customer-managed data plane contains the Controller and the request proxy, while the Tetrate-hosted management plane stores routing rules, policies, and user configuration.

Reference workload

The reference target for this deployment is:

Dimension	Target
Users	~60,000
Models	~25
LLM calls per month	~1.9 million

A monthly call volume converts to an average request rate as follows. A 30-day month contains roughly 2.6 million seconds (30 × 24 × 3600). Dividing 1.9 million calls by that figure gives an average of approximately 0.7 requests per second across the whole month.

Derived quantity	Approximate value
Calls per month	1,900,000
Seconds per 30-day month	2,592,000
Average requests per second	~0.7

An average of under one request per second is a modest steady-state load. Provisioning to the average would be a mistake. Real traffic from 60,000 users is not uniform across the month; it concentrates into working hours, time zones, and bursts driven by application behaviour. Peak concurrency is the figure that sizes the deployment, and it can sit one to two orders of magnitude above the monthly average.

The peaking factor, the ratio of peak rate to average rate, is the single most important unknown in this exercise. It depends on usage patterns that only the field and SME team can confirm for this engagement. A common planning approach is to assume traffic concentrates into a fraction of the day and to size against that window rather than the 30-day average:

Concentration assumption	Effective window	Approximate peak requests per second
Traffic spread evenly across 30 days	2,592,000 s	~0.7
Traffic within an 8-hour working day, 22 working days	633,600 s	~3
Traffic within a 2-hour daily peak, 22 working days	158,400 s	~12
Bursty interactive load (short spikes)	Not applicable	well above the figures above

These rows illustrate sensitivity to assumptions, not a prediction. The correct peaking factor for this engagement must be supplied by the field/SME team and confirmed against observed traffic before sizing is finalised.

Sizing dimensions

Request rate alone does not size a deployment. The dimensions below jointly determine the data-plane footprint and the provider-side quota required.

Dimension	Why it matters
Concurrent requests	The number of in-flight requests at peak, not the monthly count, drives memory and connection-pool sizing. Long-running streaming requests hold resources for their full duration
Tokens per request	Larger prompts and completions increase per-request CPU, memory, and provider latency. Token distribution, not just request count, determines provider throughput and cost
Streaming vs. non-streaming	Streaming responses hold a connection open for the full generation. A workload that is predominantly streaming sustains far more concurrent connections than the same request rate served as discrete responses
Number of models and providers	The ~25 models map to a set of upstream providers, each with its own rate limits, latency profile, and quota. Routing and fallback configuration grows with this count
Regions	A multi-region deployment multiplies the data-plane footprint and introduces cross-region latency. Each region is sized for its own share of peak concurrency

Data-plane sizing

The data plane scales horizontally. Both the Controller and the routing proxy add capacity by adding replicas rather than by enlarging a single instance.

The Agent Router (the proxy component) handles the request data path. Its capacity is governed by peak concurrent connections and per-request token volume. Replicas are added behind a load balancer to absorb peak concurrency, with headroom held in reserve above the expected peak.
The Controller manages configuration and the control path. It scales with the number of distinct routing rules, API keys, models, and providers rather than with raw request rate.

The sizing method is the same regardless of the target:

Establish the peak concurrent-request figure from the average rate and the agreed peaking factor.
Characterise the workload mix (token distribution and the streaming/non-streaming split) because these set per-connection resource cost.
Measure the throughput of a single Agent Router replica on the target infrastructure under a representative load profile.
Divide peak concurrency by per-replica throughput to obtain a replica count, then add headroom for failover, rolling upgrades, and traffic above the modelled peak.

Per-replica throughput is a function of the instance type, the workload mix, and the platform version. For that reason this page does not state a per-pod request rate. The figure must be measured against the deployment, not assumed.

Throughput and latency

The gateway adds minimal overhead to each request. End-to-end latency is dominated by the upstream provider's model-inference time, which is typically orders of magnitude larger than the gateway's internal processing. Sizing for latency is therefore mostly a question of provider behaviour and provider-side quota, not of gateway capacity.

Latency component	Relative contribution	Notes
Gateway processing	Small	Routing resolution, policy evaluation, and normalisation. Observable as the difference between gateway latency and backend latency in the metric stream
Provider inference	Dominant	Time the upstream provider spends generating the response. Scales with output token count and provider load
Network	Variable	Cross-region hops add to this component in a multi-region deployment

The metric families that separate these components (request latency, backend latency, and time to first token) are documented in OTel Metrics. Because provider latency dominates, the practical latency ceiling at a given concurrency is set by how many in-flight requests the chosen providers can serve without queuing, which ties back to provider-side rate limits.

The gateway's normalisation, fallback, and error semantics under load are described in Gateway Behavior. Fallback chains add latency on the failed leg of a walk and should be accounted for when modelling worst-case latency at peak.

Headroom for rate limits and budgets

At 60,000 users the platform's own rate-limit and budget controls are a sizing concern in their own right, separate from data-plane capacity.

Per-key and per-group rate limits should be set with headroom above expected peak so that legitimate peak traffic is not throttled, while still bounding runaway clients. Limit configuration is covered in Set Rate Limits.
Upstream provider quotas are a hard ceiling that the platform cannot exceed. The sum of configured per-key limits across all keys should be reconciled against the aggregate provider quota for each provider, with headroom retained for fallback traffic redirected from a degraded provider.
Budget controls bound spend rather than throughput, but at this user count a budget set too tight will reject traffic that capacity could otherwise serve. Budgets and limits should be modelled together.

Observability at scale

Telemetry volume and metric cardinality grow with users, models, and API keys. The metric and trace schema is documented in OTel Metrics; the export configuration is covered in Export Telemetry to an Observability Stack.

Two cardinality drivers matter at this scale:

High-cardinality metric labels (api_key_id and requested_model) multiply the number of distinct time series the metrics backend must store. With ~25 models and a large key population, per-key label cardinality is the dominant cost. Where the backend cannot absorb it, the per-key dimensions can be left out of the exported labels and read instead from the in-Console Usage Analytics surface, which aggregates them internally.
Trace volume scales directly with request rate. At peak, exporting every trace can overwhelm the destination backend and inflate cost. Trace sampling below 100 % is the standard mitigation for high-volume deployments. Metrics are not sampled and remain the complete record; sampling affects only the trace stream. Sampling behaviour is detailed in OTel Metrics.

The observability backend should be sized for peak time-series count and peak trace ingest, using the same peaking factor applied to the data plane.

Capacity questions to confirm

The figures on this page rest on assumptions that the field and SME team should confirm before any number is committed:

What is the peaking factor (the ratio of peak request rate to the monthly average) for this engagement, and over what window does traffic concentrate?
What fraction of requests is streaming, and what is the expected token distribution per request?
How many upstream providers serve the ~25 models, and what is the rate limit and quota for each?
Is the deployment single-region or multi-region, and how is the 60,000-user population distributed across regions?
What infrastructure (instance types, node sizing) will host the data plane, so that per-replica throughput can be measured rather than assumed?
What is the metrics backend's tolerance for high-cardinality labels, and what trace sampling rate is acceptable at peak?
What headroom margin above modelled peak is required for failover and rolling upgrades?

Each answer narrows a planning approximation toward a number that can be validated by a load test against the target environment.

Architecture Overview: data plane and management plane components
Gateway Behavior: error semantics and per-request data under load
OTel Metrics: metric families, labels, and sampling
Set Rate Limits: per-key and per-group limit configuration
Export Telemetry to an Observability Stack: telemetry export configuration

Where to go next

Plan high availability and disaster recovery

Build resilience and failover into the sized deployment.

Data plane installation

Deploy the data plane on the target infrastructure.

Reference workload​

Sizing dimensions​

Data-plane sizing​

Throughput and latency​

Headroom for rate limits and budgets​

Observability at scale​

Capacity questions to confirm​

Related​