Build a chat app
Build a streaming chat application that uses Agent Router Enterprise for model routing, with multi-turn conversation support, streaming responses, and automatic failover.
All code snippets in this guide are Python. Any OpenAI SDK is supported; Python is not required.
Architecture
Requests flow from the frontend through the application backend to the platform, which routes to a provider and streams the response back over server-sent events (SSE).
| Built by the application | Handled by the platform |
|---|---|
| Frontend UI, conversation state, API route | Provider routing, streaming, failover, cost tracking |
Step 1 - Set up the client
This step stands up the backend that the chat application calls. A FastAPI application exposes a single POST /api/chat route, and an OpenAI client is configured with the platform API key and base URL. On each request, the route reads the message array from the request body, opens a streaming completion against the platform, and relays each token to the caller as a server-sent event. The stream_options={"include_usage": True} flag requests token-usage data in the final chunk, and the result is returned as a StreamingResponse with the text/event-stream media type, which is what allows the frontend to render the reply as it arrives rather than after the full response completes.
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI
import os
import json
app = FastAPI()
client = OpenAI(
api_key=os.environ["Agent Router_API_KEY"],
base_url="https://api.router.tetrate.ai/v1",
)
@app.post("/api/chat")
async def chat(request: Request):
body = await request.json()
messages = body.get("messages", [])
stream = client.chat.completions.create(
model="gpt-5.5",
messages=messages,
stream=True,
stream_options={"include_usage": True},
)
async def generate():
for chunk in stream:
delta = chunk.choices[0].delta if chunk.choices else None
if delta and delta.content:
yield f"data: {json.dumps({'content': delta.content})}\n\n"
# Usage comes in the final chunk
if hasattr(chunk, 'usage') and chunk.usage:
yield f"data: {json.dumps({'usage': {'prompt_tokens': chunk.usage.prompt_tokens, 'completion_tokens': chunk.usage.completion_tokens}})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Step 2 - Multi-turn conversations
This step gives the assistant memory of the conversation so far. The chat completions API is stateless and retains nothing between calls, so the full conversation must accompany every request. A running conversation list holds that history: the system prompt seeds it, each user message is appended before the request, and each assistant reply is appended after it. Every turn therefore sends the complete dialogue, which is what lets the model resolve follow-up questions such as "Can you give me an example?" against everything said earlier. The same array is what the streaming route in Step 1 receives from the frontend.
Conversation history is maintained by passing the full message array on each request:
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
]
def chat_turn(user_message: str) -> str:
conversation.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-5.5",
messages=conversation,
)
assistant_message = response.choices[0].message.content
conversation.append({"role": "assistant", "content": assistant_message})
return assistant_message
# Each turn includes full history
print(chat_turn("What's Agent Router?"))
print(chat_turn("How does fallback routing work?"))
print(chat_turn("Can you give me an example?"))
Step 3 - Add fallback for production
This step protects the chat application against a single provider failing. A fallback policy is an ordered list of providers attached to the API key. When the primary provider returns a recoverable error, a rate-limit response, or a timeout, the platform automatically retries the next provider in the list, so one outage does not interrupt the conversation. The policy is configured in the platform's dashboard rather than in code, which is why the backend from Step 1 stays unchanged: the same chat.completions.create call gains failover the moment the policy is in place. No code changes are required; the platform handles failover automatically.
Step 4 - Track usage
This step surfaces what each conversation costs. Because every request passes through the platform, token usage and cost are recorded centrally, with no per-application instrumentation. When streaming with stream_options: { include_usage: true }, the prompt and completion token counts set in Step 1 are included in the final SSE event, even for providers that do not include them natively, so the frontend can show a running token or cost figure as the reply streams in.
Per-model cost breakdowns are available in the platform's dashboard for review after the fact, or through the Usage API for programmatic reporting.
Where to go next