OpenTelemetry metrics
The gateway emits two complementary streams of observability data. Trace data flows over OTLP to any OpenTelemetry-compatible backend; metrics are exposed by each data plane component on a Prometheus-compatible scrape endpoint, where they can be pulled by an existing metrics agent and forwarded as OTLP metrics by a collector if the destination requires it. This page documents the structure of both streams: the span model used for traces, the standard set of attributes that appear on every span and metric, and the metric families exposed by the gateway. For the configuration mechanics of trace export and the supported authentication modes, see Export Telemetry to an Observability Stack.
Trace structure
Every gateway request produces a single trace composed of nested spans. The span structure reflects the request's path through the data plane and the upstream provider.
| Span | Covers |
|---|---|
| Gateway request | The full lifecycle of the request inside the data plane, from arrival on the gateway endpoint to the final response delivered to the client |
| Provider routing | The time spent resolving the routing configuration and selecting an eligible backend |
| Model inference | The time the upstream provider spent generating the response |
For fallback chains, the trace includes one provider-routing span per attempt walked. A request that succeeds on the primary backend has a single provider-routing span; a request that falls through to a secondary backend has two, with the first marked as failed.
Standard span attributes
The following attributes appear on the gateway-request span and, where relevant, on the provider-routing and model-inference spans:
| Attribute | Type | Description |
|---|---|---|
service.name | string | The service identifier configured on the OTel Export screen. Defaults to agent-router. |
http.method | string | The HTTP method of the request (typically POST) |
http.target | string | The request path, such as /v1/chat/completions |
http.status_code | integer | The final HTTP status returned to the client |
gateway.request_id | string | The gateway-assigned X-Request-ID UUID |
gateway.client_request_id | string | The client-supplied X-Request-ID, echoed back as X-Client-Request-ID. Present only when the client supplied one |
gateway.api_key_id | string | A non-secret identifier for the API key used. The raw key value is never exported |
gateway.requested_model | string | The model identifier the client asked for, before routing resolution |
gateway.resolved_model | string | The model that actually served the request, after routing resolution and any logical-name override |
gateway.resolved_provider | string | The upstream provider that served the request |
gateway.fallback_attempts | integer | The number of backends attempted before the request succeeded. 0 indicates the primary backend served the request |
llm.usage.input_tokens | integer | Input (prompt) token count |
llm.usage.output_tokens | integer | Output (completion) token count |
llm.usage.total_tokens | integer | Combined input plus output tokens |
gateway.latency_ms | integer | End-to-end latency observed by the gateway, in milliseconds |
gateway.time_to_first_token_ms | integer | Time from request receipt to the first response token, in milliseconds. Present for streaming responses |
Application-supplied custom headers are also forwarded as span attributes. Headers such as agent-session-id, user-id, or workflow-run-id appear on the trace under their original names and allow traces to be grouped by application-level concepts in the destination backend.
Span events
Events on the gateway-request span record discrete moments in the request's lifecycle:
| Event | Recorded when |
|---|---|
request.received | The gateway receives the request from the client |
routing.resolved | The routing configuration has selected a backend |
backend.attempted | A backend is dialled; this event repeats for fallback walks |
backend.failed | A backend returned a retryable error and the gateway walked to the next backend |
response.first_byte | The first response byte (or first SSE event) is sent to the client |
response.completed | The response has been fully delivered to the client |
request.failed | The request returned a non-2xx status to the client |
OpenInference span attributes
Alongside the OpenTelemetry metrics and the standard span attributes listed above, the gateway emits LLM-specific span attributes that follow the OpenInference semantic conventions, the open standard for large-language-model trace attributes. These attributes describe the inference call in terms an LLM-aware backend understands, so tools such as Arize Phoenix, or any OpenInference-compatible viewer, can interpret the trace as a model invocation rather than a generic HTTP span.
The attributes are grouped as follows. Conventional, spec-aligned names are shown where they are well established; the exact key set is defined by the OpenInference specification and should be confirmed against the running version.
| Attribute group | Representative key | Description |
|---|---|---|
| Span kind | openinference.span.kind | Marks the span as an LLM call (LLM), distinguishing it from retrieval, tool, or chain spans |
| Model and provider identity | llm.model_name, llm.provider | The resolved model and upstream provider that served the inference, mirroring the gateway routing result |
| Token usage | llm.token_count.prompt, llm.token_count.completion, llm.token_count.total | Prompt, completion, and combined token counts for the call |
| Invocation parameters | llm.invocation_parameters | The request parameters passed to the provider, such as temperature and max tokens, captured as a structured value |
| Input and output messages | llm.input_messages, llm.output_messages | The captured prompt and completion content, subject to the configured logging mode |
The presence of the input and output message attributes depends on the request-logging configuration. When the logging mode excludes prompt and response content, the message groups are omitted while the remaining attributes (span kind, model identity, token usage, and invocation parameters) are still emitted. See Configuring Request Logs for the logging-mode controls.
For the export configuration that delivers these spans to a backend, see Export Telemetry to an Observability Stack.
Metric families
The Prometheus scrape endpoint exposed by each data plane component publishes counters, gauges, and histograms covering the same activity that the trace stream describes. Each metric carries a standard set of labels that align with the span attributes above, so traces and metrics can be joined in the destination observability stack.
The specific metric names are defined by each data plane component and are subject to change between releases. Where exact names matter for an alerting rule or a dashboard, verify against the live scrape endpoint of the deployment. The families described below are stable.
Request volume and outcomes
| Family | Type | Purpose |
|---|---|---|
| Total requests received | Counter | Increments once per request that arrives at the gateway. Carries the standard label set |
| Requests by HTTP status | Counter | Partitioned by the final HTTP status code; allows alert rules on elevated 4xx or 5xx ratios |
| Active requests | Gauge | The number of in-flight requests currently being processed |
Latency
| Family | Type | Purpose |
|---|---|---|
| Request latency | Histogram | End-to-end latency observed by the gateway. Buckets are deployment-specific; expect percentiles such as p50, p95, p99 to be derivable from the histogram |
| Time to first token | Histogram | The streaming-specific equivalent of request latency, capturing time from receipt to first response token |
| Backend latency | Histogram | The latency observed on the upstream provider call itself, excluding gateway-internal processing |
Token usage
| Family | Type | Purpose |
|---|---|---|
| Input tokens | Counter | Total input tokens consumed |
| Output tokens | Counter | Total output tokens generated |
| Tokens per request | Histogram | Distribution of per-request token counts, useful for spotting outliers |
Routing and fallback
| Family | Type | Purpose |
|---|---|---|
| Fallback attempts | Counter | Increments each time the gateway walks to the next backend in a fallback chain. A sudden rise is a strong leading indicator of a provider incident |
| Fallback successes | Counter | Increments when a fallback attempt succeeds |
| Routing decisions | Counter | Partitioned by resolved model and provider; the canonical view of how traffic distributes after routing rules are applied |
Errors and provider behaviour
| Family | Type | Purpose |
|---|---|---|
| Provider errors | Counter | Provider-side failures normalised into the gateway's error categories |
| Gateway errors | Counter | Gateway-side failures, such as malformed requests or unknown models |
| Rate-limit responses | Counter | The subset of provider errors classified as rate-limit responses (HTTP 429) |
| Timeout responses | Counter | The subset of provider errors classified as timeouts |
Standard labels
Every metric family carries the same core label set. Where the cardinality cost is acceptable, additional labels are exposed for finer-grained queries.
| Label | Cardinality | Description |
|---|---|---|
instance_name | Low | The platform instance name, useful when multiple instances export to the same destination |
resolved_model | Medium | The model that served the request, post-routing |
resolved_provider | Low | The provider that served the request, post-routing |
api_key_id | High | The API key identifier; bound by the number of keys in the deployment |
http_status | Low | The HTTP status code returned to the client |
requested_model | Medium | The model the client asked for, before routing resolution. Present where it differs from resolved_model |
High-cardinality labels (api_key_id, requested_model) are valuable for per-key dashboards and chargeback reporting but expensive in some metrics backends. Where cardinality is a constraint, the in-Console Usage Analytics surface aggregates the same dimensions internally without requiring the labels to be exported.
Sampling
Trace sampling is configured on the OTel Export screen. The default is 100 %; high-volume production deployments typically sample lower. Metrics are not sampled; every request is counted in every relevant family, regardless of the trace sampling rate.
For a deployment that sets the sampling rate below 100 %, traces are a sample of the activity and metrics are the complete record. Investigations that need every request (forensic audit, billing reconciliation) should use Request Logs or the metric stream rather than the trace stream; investigations that need execution detail (latency breakdown, fallback walks) use the trace stream.
Related
- Gateway Behavior: correlation IDs and per-request data
- Audit Log Events: event schema for administrative actions
- OpenTelemetry Export: configuration surface for trace export
- Export Telemetry to an Observability Stack: developer-side configuration walkthrough
Where to go next