Skip to main content

OpenTelemetry metrics

The gateway emits two complementary streams of observability data. Trace data flows over OTLP to any OpenTelemetry-compatible backend; metrics are exposed by each data plane component on a Prometheus-compatible scrape endpoint, where they can be pulled by an existing metrics agent and forwarded as OTLP metrics by a collector if the destination requires it. This page documents the structure of both streams: the span model used for traces, the standard set of attributes that appear on every span and metric, and the metric families exposed by the gateway. For the configuration mechanics of trace export and the supported authentication modes, see Export Telemetry to an Observability Stack.


Trace structure

Every gateway request produces a single trace composed of nested spans. The span structure reflects the request's path through the data plane and the upstream provider.

SpanCovers
Gateway requestThe full lifecycle of the request inside the data plane, from arrival on the gateway endpoint to the final response delivered to the client
Provider routingThe time spent resolving the routing configuration and selecting an eligible backend
Model inferenceThe time the upstream provider spent generating the response

For fallback chains, the trace includes one provider-routing span per attempt walked. A request that succeeds on the primary backend has a single provider-routing span; a request that falls through to a secondary backend has two, with the first marked as failed.

Standard span attributes

The following attributes appear on the gateway-request span and, where relevant, on the provider-routing and model-inference spans:

AttributeTypeDescription
service.namestringThe service identifier configured on the OTel Export screen. Defaults to agent-router.
http.methodstringThe HTTP method of the request (typically POST)
http.targetstringThe request path, such as /v1/chat/completions
http.status_codeintegerThe final HTTP status returned to the client
gateway.request_idstringThe gateway-assigned X-Request-ID UUID
gateway.client_request_idstringThe client-supplied X-Request-ID, echoed back as X-Client-Request-ID. Present only when the client supplied one
gateway.api_key_idstringA non-secret identifier for the API key used. The raw key value is never exported
gateway.requested_modelstringThe model identifier the client asked for, before routing resolution
gateway.resolved_modelstringThe model that actually served the request, after routing resolution and any logical-name override
gateway.resolved_providerstringThe upstream provider that served the request
gateway.fallback_attemptsintegerThe number of backends attempted before the request succeeded. 0 indicates the primary backend served the request
llm.usage.input_tokensintegerInput (prompt) token count
llm.usage.output_tokensintegerOutput (completion) token count
llm.usage.total_tokensintegerCombined input plus output tokens
gateway.latency_msintegerEnd-to-end latency observed by the gateway, in milliseconds
gateway.time_to_first_token_msintegerTime from request receipt to the first response token, in milliseconds. Present for streaming responses

Application-supplied custom headers are also forwarded as span attributes. Headers such as agent-session-id, user-id, or workflow-run-id appear on the trace under their original names and allow traces to be grouped by application-level concepts in the destination backend.

Span events

Events on the gateway-request span record discrete moments in the request's lifecycle:

EventRecorded when
request.receivedThe gateway receives the request from the client
routing.resolvedThe routing configuration has selected a backend
backend.attemptedA backend is dialled; this event repeats for fallback walks
backend.failedA backend returned a retryable error and the gateway walked to the next backend
response.first_byteThe first response byte (or first SSE event) is sent to the client
response.completedThe response has been fully delivered to the client
request.failedThe request returned a non-2xx status to the client

OpenInference span attributes

Alongside the OpenTelemetry metrics and the standard span attributes listed above, the gateway emits LLM-specific span attributes that follow the OpenInference semantic conventions, the open standard for large-language-model trace attributes. These attributes describe the inference call in terms an LLM-aware backend understands, so tools such as Arize Phoenix, or any OpenInference-compatible viewer, can interpret the trace as a model invocation rather than a generic HTTP span.

The attributes are grouped as follows. Conventional, spec-aligned names are shown where they are well established; the exact key set is defined by the OpenInference specification and should be confirmed against the running version.

Attribute groupRepresentative keyDescription
Span kindopeninference.span.kindMarks the span as an LLM call (LLM), distinguishing it from retrieval, tool, or chain spans
Model and provider identityllm.model_name, llm.providerThe resolved model and upstream provider that served the inference, mirroring the gateway routing result
Token usagellm.token_count.prompt, llm.token_count.completion, llm.token_count.totalPrompt, completion, and combined token counts for the call
Invocation parametersllm.invocation_parametersThe request parameters passed to the provider, such as temperature and max tokens, captured as a structured value
Input and output messagesllm.input_messages, llm.output_messagesThe captured prompt and completion content, subject to the configured logging mode

The presence of the input and output message attributes depends on the request-logging configuration. When the logging mode excludes prompt and response content, the message groups are omitted while the remaining attributes (span kind, model identity, token usage, and invocation parameters) are still emitted. See Configuring Request Logs for the logging-mode controls.

For the export configuration that delivers these spans to a backend, see Export Telemetry to an Observability Stack.


Metric families

The Prometheus scrape endpoint exposed by each data plane component publishes counters, gauges, and histograms covering the same activity that the trace stream describes. Each metric carries a standard set of labels that align with the span attributes above, so traces and metrics can be joined in the destination observability stack.

The specific metric names are defined by each data plane component and are subject to change between releases. Where exact names matter for an alerting rule or a dashboard, verify against the live scrape endpoint of the deployment. The families described below are stable.

Request volume and outcomes

FamilyTypePurpose
Total requests receivedCounterIncrements once per request that arrives at the gateway. Carries the standard label set
Requests by HTTP statusCounterPartitioned by the final HTTP status code; allows alert rules on elevated 4xx or 5xx ratios
Active requestsGaugeThe number of in-flight requests currently being processed

Latency

FamilyTypePurpose
Request latencyHistogramEnd-to-end latency observed by the gateway. Buckets are deployment-specific; expect percentiles such as p50, p95, p99 to be derivable from the histogram
Time to first tokenHistogramThe streaming-specific equivalent of request latency, capturing time from receipt to first response token
Backend latencyHistogramThe latency observed on the upstream provider call itself, excluding gateway-internal processing

Token usage

FamilyTypePurpose
Input tokensCounterTotal input tokens consumed
Output tokensCounterTotal output tokens generated
Tokens per requestHistogramDistribution of per-request token counts, useful for spotting outliers

Routing and fallback

FamilyTypePurpose
Fallback attemptsCounterIncrements each time the gateway walks to the next backend in a fallback chain. A sudden rise is a strong leading indicator of a provider incident
Fallback successesCounterIncrements when a fallback attempt succeeds
Routing decisionsCounterPartitioned by resolved model and provider; the canonical view of how traffic distributes after routing rules are applied

Errors and provider behaviour

FamilyTypePurpose
Provider errorsCounterProvider-side failures normalised into the gateway's error categories
Gateway errorsCounterGateway-side failures, such as malformed requests or unknown models
Rate-limit responsesCounterThe subset of provider errors classified as rate-limit responses (HTTP 429)
Timeout responsesCounterThe subset of provider errors classified as timeouts

Standard labels

Every metric family carries the same core label set. Where the cardinality cost is acceptable, additional labels are exposed for finer-grained queries.

LabelCardinalityDescription
instance_nameLowThe platform instance name, useful when multiple instances export to the same destination
resolved_modelMediumThe model that served the request, post-routing
resolved_providerLowThe provider that served the request, post-routing
api_key_idHighThe API key identifier; bound by the number of keys in the deployment
http_statusLowThe HTTP status code returned to the client
requested_modelMediumThe model the client asked for, before routing resolution. Present where it differs from resolved_model

High-cardinality labels (api_key_id, requested_model) are valuable for per-key dashboards and chargeback reporting but expensive in some metrics backends. Where cardinality is a constraint, the in-Console Usage Analytics surface aggregates the same dimensions internally without requiring the labels to be exported.


Sampling

Trace sampling is configured on the OTel Export screen. The default is 100 %; high-volume production deployments typically sample lower. Metrics are not sampled; every request is counted in every relevant family, regardless of the trace sampling rate.

For a deployment that sets the sampling rate below 100 %, traces are a sample of the activity and metrics are the complete record. Investigations that need every request (forensic audit, billing reconciliation) should use Request Logs or the metric stream rather than the trace stream; investigations that need execution detail (latency breakdown, fallback walks) use the trace stream.