OpenTelemetry metrics

The gateway emits two complementary streams of observability data. Trace data flows over OTLP to any OpenTelemetry-compatible backend; metrics are exposed by each data plane component on a Prometheus-compatible scrape endpoint, where they can be pulled by an existing metrics agent and forwarded as OTLP metrics by a collector if the destination requires it. This page documents the structure of both streams: the span model used for traces, the standard set of attributes that appear on every span and metric, and the metric families exposed by the gateway. For the configuration mechanics of trace export and the supported authentication modes, see Export Telemetry to an Observability Stack.

Trace structure

Every gateway request produces a single trace composed of nested spans. The span structure reflects the request's path through the data plane and the upstream provider.

Span	Covers
Gateway request	The full lifecycle of the request inside the data plane, from arrival on the gateway endpoint to the final response delivered to the client
Provider routing	The time spent resolving the routing configuration and selecting an eligible backend
Model inference	The time the upstream provider spent generating the response

For fallback chains, the trace includes one provider-routing span per attempt walked. A request that succeeds on the primary backend has a single provider-routing span; a request that falls through to a secondary backend has two, with the first marked as failed.

Standard span attributes

The following attributes appear on the gateway-request span and, where relevant, on the provider-routing and model-inference spans:

Attribute	Type	Description
`service.name`	string	The service identifier configured on the OTel Export screen. Defaults to `agent-router`.
`http.method`	string	The HTTP method of the request (typically `POST`)
`http.target`	string	The request path, such as `/v1/chat/completions`
`http.status_code`	integer	The final HTTP status returned to the client
`gateway.request_id`	string	The gateway-assigned `X-Request-ID` UUID
`gateway.client_request_id`	string	The client-supplied `X-Request-ID`, echoed back as `X-Client-Request-ID`. Present only when the client supplied one
`gateway.api_key_id`	string	A non-secret identifier for the API key used. The raw key value is never exported
`gateway.requested_model`	string	The model identifier the client asked for, before routing resolution
`gateway.resolved_model`	string	The model that actually served the request, after routing resolution and any logical-name override
`gateway.resolved_provider`	string	The upstream provider that served the request
`gateway.fallback_attempts`	integer	The number of backends attempted before the request succeeded. `0` indicates the primary backend served the request
`llm.usage.input_tokens`	integer	Input (prompt) token count
`llm.usage.output_tokens`	integer	Output (completion) token count
`llm.usage.total_tokens`	integer	Combined input plus output tokens
`gateway.latency_ms`	integer	End-to-end latency observed by the gateway, in milliseconds
`gateway.time_to_first_token_ms`	integer	Time from request receipt to the first response token, in milliseconds. Present for streaming responses

Application-supplied custom headers are also forwarded as span attributes. Headers such as agent-session-id, user-id, or workflow-run-id appear on the trace under their original names and allow traces to be grouped by application-level concepts in the destination backend.

Span events

Events on the gateway-request span record discrete moments in the request's lifecycle:

Event	Recorded when
`request.received`	The gateway receives the request from the client
`routing.resolved`	The routing configuration has selected a backend
`backend.attempted`	A backend is dialled; this event repeats for fallback walks
`backend.failed`	A backend returned a retryable error and the gateway walked to the next backend
`response.first_byte`	The first response byte (or first SSE event) is sent to the client
`response.completed`	The response has been fully delivered to the client
`request.failed`	The request returned a non-2xx status to the client

OpenInference span attributes

Alongside the OpenTelemetry metrics and the standard span attributes listed above, the gateway emits LLM-specific span attributes that follow the OpenInference semantic conventions, the open standard for large-language-model trace attributes. These attributes describe the inference call in terms an LLM-aware backend understands, so tools such as Arize Phoenix, or any OpenInference-compatible viewer, can interpret the trace as a model invocation rather than a generic HTTP span.

The attributes are grouped as follows. Conventional, spec-aligned names are shown where they are well established; the exact key set is defined by the OpenInference specification and should be confirmed against the running version.

Attribute group	Representative key	Description
Span kind	`openinference.span.kind`	Marks the span as an LLM call (`LLM`), distinguishing it from retrieval, tool, or chain spans
Model and provider identity	`llm.model_name`, `llm.provider`	The resolved model and upstream provider that served the inference, mirroring the gateway routing result
Token usage	`llm.token_count.prompt`, `llm.token_count.completion`, `llm.token_count.total`	Prompt, completion, and combined token counts for the call
Invocation parameters	`llm.invocation_parameters`	The request parameters passed to the provider, such as temperature and max tokens, captured as a structured value
Input and output messages	`llm.input_messages`, `llm.output_messages`	The captured prompt and completion content, subject to the configured logging mode

The presence of the input and output message attributes depends on the request-logging configuration. When the logging mode excludes prompt and response content, the message groups are omitted while the remaining attributes (span kind, model identity, token usage, and invocation parameters) are still emitted. See Configuring Request Logs for the logging-mode controls.

For the export configuration that delivers these spans to a backend, see Export Telemetry to an Observability Stack.

Metric families

The Prometheus scrape endpoint exposed by each data plane component publishes counters, gauges, and histograms covering the same activity that the trace stream describes. Each metric carries a standard set of labels that align with the span attributes above, so traces and metrics can be joined in the destination observability stack.

The specific metric names are defined by each data plane component and are subject to change between releases. Where exact names matter for an alerting rule or a dashboard, verify against the live scrape endpoint of the deployment. The families described below are stable.

Request volume and outcomes

Family	Type	Purpose
Total requests received	Counter	Increments once per request that arrives at the gateway. Carries the standard label set
Requests by HTTP status	Counter	Partitioned by the final HTTP status code; allows alert rules on elevated 4xx or 5xx ratios
Active requests	Gauge	The number of in-flight requests currently being processed

Latency

Family	Type	Purpose
Request latency	Histogram	End-to-end latency observed by the gateway. Buckets are deployment-specific; expect percentiles such as p50, p95, p99 to be derivable from the histogram
Time to first token	Histogram	The streaming-specific equivalent of request latency, capturing time from receipt to first response token
Backend latency	Histogram	The latency observed on the upstream provider call itself, excluding gateway-internal processing

Token usage

Family	Type	Purpose
Input tokens	Counter	Total input tokens consumed
Output tokens	Counter	Total output tokens generated
Tokens per request	Histogram	Distribution of per-request token counts, useful for spotting outliers

Routing and fallback

Family	Type	Purpose
Fallback attempts	Counter	Increments each time the gateway walks to the next backend in a fallback chain. A sudden rise is a strong leading indicator of a provider incident
Fallback successes	Counter	Increments when a fallback attempt succeeds
Routing decisions	Counter	Partitioned by resolved model and provider; the canonical view of how traffic distributes after routing rules are applied

Errors and provider behaviour

Family	Type	Purpose
Provider errors	Counter	Provider-side failures normalised into the gateway's error categories
Gateway errors	Counter	Gateway-side failures, such as malformed requests or unknown models
Rate-limit responses	Counter	The subset of provider errors classified as rate-limit responses (HTTP 429)
Timeout responses	Counter	The subset of provider errors classified as timeouts

Standard labels

Every metric family carries the same core label set. Where the cardinality cost is acceptable, additional labels are exposed for finer-grained queries.

Label	Cardinality	Description
`instance_name`	Low	The platform instance name, useful when multiple instances export to the same destination
`resolved_model`	Medium	The model that served the request, post-routing
`resolved_provider`	Low	The provider that served the request, post-routing
`api_key_id`	High	The API key identifier; bound by the number of keys in the deployment
`http_status`	Low	The HTTP status code returned to the client
`requested_model`	Medium	The model the client asked for, before routing resolution. Present where it differs from `resolved_model`

High-cardinality labels (api_key_id, requested_model) are valuable for per-key dashboards and chargeback reporting but expensive in some metrics backends. Where cardinality is a constraint, the in-Console Usage Analytics surface aggregates the same dimensions internally without requiring the labels to be exported.

Sampling

Trace sampling is configured on the OTel Export screen. The default is 100 %; high-volume production deployments typically sample lower. Metrics are not sampled; every request is counted in every relevant family, regardless of the trace sampling rate.

For a deployment that sets the sampling rate below 100 %, traces are a sample of the activity and metrics are the complete record. Investigations that need every request (forensic audit, billing reconciliation) should use Request Logs or the metric stream rather than the trace stream; investigations that need execution detail (latency breakdown, fallback walks) use the trace stream.

Gateway Behavior: correlation IDs and per-request data
Audit Log Events: event schema for administrative actions
OpenTelemetry Export: configuration surface for trace export
Export Telemetry to an Observability Stack: developer-side configuration walkthrough

Trace structure​

Standard span attributes​

Span events​

OpenInference span attributes​

Metric families​

Request volume and outcomes​

Latency​

Token usage​

Routing and fallback​

Errors and provider behaviour​

Standard labels​

Sampling​

Related​