Skip to main content

Build a chat app

Build a streaming chat application that uses Agent Router Enterprise for model routing, with multi-turn conversation support, streaming responses, and automatic failover.


info

All code snippets in this guide are Python. Any OpenAI SDK is supported; Python is not required.

Architecture

Requests flow from the frontend through the application backend to the platform, which routes to a provider and streams the response back over server-sent events (SSE).

Built by the applicationHandled by the platform
Frontend UI, conversation state, API routeProvider routing, streaming, failover, cost tracking

Step 1 - Set up the client

This step stands up the backend that the chat application calls. A FastAPI application exposes a single POST /api/chat route, and an OpenAI client is configured with the platform API key and base URL. On each request, the route reads the message array from the request body, opens a streaming completion against the platform, and relays each token to the caller as a server-sent event. The stream_options={"include_usage": True} flag requests token-usage data in the final chunk, and the result is returned as a StreamingResponse with the text/event-stream media type, which is what allows the frontend to render the reply as it arrives rather than after the full response completes.

Python
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI
import os
import json

app = FastAPI()
client = OpenAI(
api_key=os.environ["Agent Router_API_KEY"],
base_url="https://api.router.tetrate.ai/v1",
)

@app.post("/api/chat")
async def chat(request: Request):
body = await request.json()
messages = body.get("messages", [])

stream = client.chat.completions.create(
model="gpt-5.5",
messages=messages,
stream=True,
stream_options={"include_usage": True},
)

async def generate():
for chunk in stream:
delta = chunk.choices[0].delta if chunk.choices else None
if delta and delta.content:
yield f"data: {json.dumps({'content': delta.content})}\n\n"
# Usage comes in the final chunk
if hasattr(chunk, 'usage') and chunk.usage:
yield f"data: {json.dumps({'usage': {'prompt_tokens': chunk.usage.prompt_tokens, 'completion_tokens': chunk.usage.completion_tokens}})}\n\n"
yield "data: [DONE]\n\n"

return StreamingResponse(generate(), media_type="text/event-stream")

Step 2 - Multi-turn conversations

This step gives the assistant memory of the conversation so far. The chat completions API is stateless and retains nothing between calls, so the full conversation must accompany every request. A running conversation list holds that history: the system prompt seeds it, each user message is appended before the request, and each assistant reply is appended after it. Every turn therefore sends the complete dialogue, which is what lets the model resolve follow-up questions such as "Can you give me an example?" against everything said earlier. The same array is what the streaming route in Step 1 receives from the frontend.

Conversation history is maintained by passing the full message array on each request:

Python
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
]

def chat_turn(user_message: str) -> str:
conversation.append({"role": "user", "content": user_message})

response = client.chat.completions.create(
model="gpt-5.5",
messages=conversation,
)

assistant_message = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_message})
return assistant_message

# Each turn includes full history
print(chat_turn("What's Agent Router?"))
print(chat_turn("How does fallback routing work?"))
print(chat_turn("Can you give me an example?"))

Step 3 - Add fallback for production

This step protects the chat application against a single provider failing. A fallback policy is an ordered list of providers attached to the API key. When the primary provider returns a recoverable error, a rate-limit response, or a timeout, the platform automatically retries the next provider in the list, so one outage does not interrupt the conversation. The policy is configured in the platform's dashboard rather than in code, which is why the backend from Step 1 stays unchanged: the same chat.completions.create call gains failover the moment the policy is in place. No code changes are required; the platform handles failover automatically.

Step 4 - Track usage

This step surfaces what each conversation costs. Because every request passes through the platform, token usage and cost are recorded centrally, with no per-application instrumentation. When streaming with stream_options: { include_usage: true }, the prompt and completion token counts set in Step 1 are included in the final SSE event, even for providers that do not include them natively, so the frontend can show a running token or cost figure as the reply streams in.

Per-model cost breakdowns are available in the platform's dashboard for review after the fact, or through the Usage API for programmatic reporting.