Skip to main content

Working with budgets

AI traffic has a habit of growing without anybody noticing until the invoice arrives. A research team starts a benchmark, leaves it running over a weekend, and bills three weeks of normal spend in two days. A new agent framework gets wired into a production path and starts making ten calls where a human would have made one. A leaked credential lands in a public repository and an attacker spends a few hours mining it before the security team notices. Each of these scenarios is preventable, but only if there is a clear answer to two questions: how much should this thing be allowed to spend, and who is responsible for noticing when it spends more.


The platform's answer to those questions is layered. There is no single "budgets" screen that owns the entire mechanism end-to-end; instead, the budgeting story is the combination of three things that already exist: per-key rate limits, which enforce hard ceilings on token consumption inline at the gateway; Usage Analytics, which is where spending against those ceilings becomes visible after the fact; and the API-key-per-purpose convention that makes the first two tools precise rather than blurry. This guide covers how to combine those mechanisms into an actual budgeting discipline, when each piece is the right tool, and the operational patterns that work across the most common spending scenarios.

Persona: Platform operator working in the Admin Dashboard, often in partnership with the developer teams that own specific API keys.

Estimated time: 15--25 minutes for an initial setup; ongoing as workloads evolve.

When this guide applies

Budgets are the right concern in any of these situations:

SituationCombination of mechanisms that fits
A production application's spend should not exceed a known monthly ceilingPer-key rate limits with usage monitoring
A research or evaluation team should not run away with the billAggressive per-key rate limits on dedicated keys
Each team's spend should be attributable to that team for chargebackPer-team API keys with usage analytics by key
The platform's overall spend across all consumers needs to stay under a contractual ceilingSum of per-key limits, plus alerting on the aggregate via OTEL export
A leaked credential needs to be bounded in damage even before it is revokedA per-key rate limit set at the consumer's known traffic level rather than at infinity

For BYOK consumers (developers who present their own provider credentials, as described in Use Your Own Provider Credentials), spend is attributed to the BYOK provider account directly, not to the platform's own usage records. The budgeting concerns there are a matter for the team that owns the BYOK account, and the patterns in this guide cover platform-managed traffic.

Outcomes

By the end of this guide:

  • At least one API key carries a meaningful rate-limit configuration aligned with the expected workload.
  • Usage Analytics is being read on a regular cadence, with per-key attribution as the unit of analysis.
  • The per-key naming and segregation conventions established in Onboard Developers and Issue Keys are in place, so usage data is precise.
  • The combination of rate limits and usage monitoring is understood as the platform's budgeting story, with telemetry export as the path to richer dashboarding for teams that need it.

Prerequisites

  • Administrator access to the Admin Dashboard: typically the super_admin or billing_admin role.
  • API keys that follow the per-purpose convention. A single platform-wide key that everyone uses makes per-purpose budgeting effectively impossible; the work this guide describes assumes the convention is already in place.
  • Coordination with the developer teams that own the affected keys. Rate limits are enforced inline, so a misconfigured limit produces visible production impact.

Step 1: decide the budgeting strategy

There are two ways to use the platform's mechanisms as a budgeting tool, and the right choice depends on the consumer.

StrategyWhat it doesWhen to use it
Hard ceilingsSet per-key rate limits at the expected traffic plus a small margin. Requests beyond the limit receive a 429 response inlineProduction workloads where unbounded spend is unacceptable; experimental keys where a runaway should be cut off automatically
Soft monitoringLeave rate limits off or set very generously, but watch Usage Analytics actively and act when traffic moves outside expectationsTrusted production workloads where occasional bursts are legitimate; cases where false-positive rate limits would be worse than occasional overspend

Most platforms end up using a mix. Production keys for critical applications get soft monitoring, because a sudden 429 cascade is the wrong failure mode for them. Research, evaluation, and CI keys get hard ceilings, because a runaway there is more annoying than a 429.

Step 2: configure per-key rate limits

Per-key rate limits are configured on the Console side, by the developer who owns the key, but the operator-side decision about which keys should carry which limits is what makes the mechanism useful. In a typical workflow, the operator establishes the policy, "research keys cap at 100 K tokens per hour", and the developer applies it to the relevant keys.

For the operator side of that conversation, the limits available on each key are:

LimitWhat it caps
Total TokensCombined input plus output tokens per rolling hour
Input TokensInput (prompt) tokens per rolling hour
Output TokensOutput (completion) tokens per rolling hour

Any combination of the three can be enabled; limits that are not enabled are not enforced. A few useful patterns:

  • Total Tokens only. The simplest configuration. A single ceiling on combined consumption, easy to reason about. Appropriate when the cost shape is roughly proportional to total tokens.
  • Output Tokens primary, Total Tokens secondary. Output tokens are typically more expensive than input tokens. Setting an explicit output-token cap with a permissive total-token cap controls cost more precisely.
  • Input Tokens only. Less common but useful for workloads that send extremely large contexts: a misconfigured RAG pipeline that retrieves the entire corpus on every request is the classic case.

A rate limit operates on a rolling one-hour window. Each enabled limit is evaluated independently, and the request is rejected with a 429 Too Many Requests response if it would push any active limit over the threshold. The token counts include both the current request and everything consumed by the key in the preceding hour.

The developer-side flow is documented in detail in Route Requests Across Providers under Rate Limiting; the configuration screen is reached by clicking Configure on a key in the Console's API Keys page.

Sizing the limit

Sizing rate limits is the part of this guide that requires the most judgment. Two failure modes are worth avoiding:

  • Limits set too tight. Production traffic hits the ceiling during normal operation, the workload starts seeing 429s, and the operator team gets paged. The cost of this failure mode is high because it looks like a platform outage from the consumer's side.
  • Limits set too loose. A runaway integration burns through a month's budget in an afternoon, and the rate limit was just a number that never fired. The cost of this failure mode is also high, but it surfaces later, in usage data rather than in an incident.

The typical right answer is to set the limit at two to three times the expected peak hourly consumption, give or take whatever margin the workload's burstiness requires. The expected peak is calculated from Usage Analytics: filter the by-API-key breakdown to the relevant key, look at the hourly traffic over the last week or two, and take the highest observed value as the baseline. The ceiling sits comfortably above that baseline but well below any spend level that would be catastrophic.

For brand-new workloads with no historical traffic, the right starting point is a deliberately tight limit. Production traffic will surface the real shape quickly, and the limit can be relaxed once the baseline is known. The reverse adjustment, discovering that a permissive limit was actually a security exposure, is much harder to recover from.

Step 3: monitor spend through usage analytics

Rate limits are the prevention. Usage Analytics is the observation. The two surfaces are mutually reinforcing: limits stop catastrophes, monitoring catches the slower drifts that limits do not catch on their own.

The usage workflow that pairs well with this guide:

  1. Open Usage Analytics in the Admin Dashboard on a regular cadence: weekly is typical, daily for environments with high turnover.
  2. Apply a time range that matches the budgeting cadence (last 7 days, last 30 days, last billing cycle).
  3. Switch to the By API Key breakdown.
  4. Sort by cost descending.
  5. Review the top consumers. The shape of the list should match the operator's mental model of the platform; surprises in this view are the most common signal that something is worth investigating.

The breakdowns by user, by model, and by provider all support the same workflow at different levels of aggregation. By User answers "which team is spending the most"; By Provider answers "which contract is bearing the load"; By Model answers "which models are doing the actual work".

For longer-horizon reporting (chargeback statements, quarterly reviews, or compliance attestations), the Export function returns the underlying data as CSV, which can be processed in whatever billing or reporting system the organisation already runs. The audit trail for the budgeting decisions themselves (limit changes, key issuance, and key revocation) is captured in Audit Platform Activity and is the right place to attach the reasoning behind each rate-limit change for future reference.

Step 4: export to a real dashboarding stack for ongoing oversight

The in-Console surfaces are sufficient for periodic review and for active investigation. They are not the right place for ongoing, always-on visibility into spend trends across a large fleet of keys. For that, the OpenTelemetry export described in the developer-side Export Telemetry to an Observability Stack guide is the recommended path.

The platform emits per-request metrics with the dimensions that budgeting cares about: API key identifier, resolved model, provider, status, token counts. Pulled into a Grafana, Datadog, Honeycomb, or other observability stack, these dimensions support:

  • Continuous dashboards. Cost per key, per team, per model, refreshed on whatever cadence the observability platform supports.
  • Alerting. Rules that fire when a key's hourly cost exceeds a threshold, when a key's traffic shape changes unexpectedly, or when the aggregate platform spend crosses a contractual ceiling.
  • Cross-system correlation. Joining platform spend to the rest of the application stack so that a sudden cost spike can be tied to a specific application deployment, a specific feature release, or a specific incident.

The split is intentional. Spending limits are enforced inline by the platform, where the cost of action is low. Spending oversight lives in the observability platform, where the cost of integrating with the rest of the organisation's tooling has already been paid.

Step 5: handle the edge cases

A few scenarios come up reliably enough to be worth addressing explicitly.

A leaked or suspected-compromised key

The fastest response is to revoke the key from the Admin API Keys surface; the mechanics are covered in Onboard Developers and Issue Keys. Revocation is immediate and irreversible.

A rate limit set at the key's normal traffic level is a useful secondary defence: even if the revocation is delayed by minutes, the limit caps the worst-case damage in the interim. Setting limits proactively on every production key, not just experimental ones, is partly a cost story and partly a security story.

A workload that legitimately needs to burst

Some workloads (end-of-month batch processing and scheduled report generation) are quiet for most of the time and very loud occasionally. A rate limit sized for the quiet baseline will fire on the burst. Three options:

  1. Lift the limit during the burst window. Increase the rate limit before the scheduled batch and reduce it afterward. Operationally clumsy but reliable.
  2. Size the limit for the burst. Set the limit at the burst level and accept that the key has more headroom than usual during quiet periods.
  3. Split the workload onto multiple keys. A dedicated key for the bursty workload, with its own limit, and a separate key for steady-state traffic. The per-key isolation also makes the analytics cleaner.

The third option is usually best because it preserves the precision of the per-key story. The operational complication of issuing a second key is typically smaller than the complication of explaining why a single key's traffic shape is irregular.

A workload that should be paused entirely

For a temporary pause (a maintenance window, a vendor escalation, or a budget freeze), the right action is to revoke the API key and reissue when the pause ends. There is no "disable temporarily" state on a key; revocation is the available mechanism. The developer team coordinates with the operator on the timing.

For longer-term pauses, the user account itself can be marked inactive in the Users surface, which prevents further activity until the account is reactivated.

What to do next

The operator-side guides have now covered the full developer-facing surface from the operator's perspective: provisioning models and providers, onboarding developers, governing MCP access, auditing platform activity, communicating with users, configuring SSO, running multiple instances, and budgeting. From here: