Generate embeddings through the gateway
Introduction
Embeddings sit underneath semantic search, retrieval-augmented generation, clustering, deduplication, and recommendation, the workloads that turn a corpus of text into something a model can reason over. The mechanics are the same as any other gateway call: an OpenAI-compatible request is sent to one endpoint, an API key authenticates it, and the platform routes it to whichever embedding model has been enabled. Routing embedding traffic through Tetrate Agent Router means the same governance, cost tracking, and provider independence that apply to chat traffic apply to embeddings too: one endpoint, one credential, and one place to see what was spent and which model produced which vectors. This guide covers the embedding path end-to-end: why an embedding model is reached through the platform rather than directly, how an enabled embedding model is selected from the catalogue, how the OpenAI-compatible embeddings endpoint is called from curl and from Python, how a batch of inputs is embedded in a single call, how the resulting vectors feed a semantic-search or RAG pipeline, why an embedding-model version is pinned so that vectors stay comparable over time, and how the call is confirmed in Request Logs. It builds directly on the key and routing setup from Route requests across providers; the embeddings endpoint reuses that same key and proxy endpoint.
Persona: Developer working in the Agent Router Console and in the application's own code.
Estimated time: 15--20 minutes for the first run, including time to select a model in the Console and copy values into a terminal.
When this guide applies
This guide applies whenever an application needs vector representations of text and the goal is to obtain them through the platform rather than from a provider directly. Typical situations:
- A retrieval-augmented generation pipeline embeds documents at ingestion time and embeds queries at request time.
- A semantic-search feature ranks results by vector similarity instead of keyword match.
- A clustering, deduplication, or classification job needs a stable numeric representation of a text corpus.
- Embedding spend and usage need to be tracked and governed alongside chat traffic rather than billed and audited separately.
The one precondition is that an embedding model has been enabled by an operator. Chat models and embedding models are provisioned the same way, but an embedding model has to be exposed in the catalogue before it can be selected.
Outcomes
By the end of this guide:
- An enabled embedding model has been identified in the Console catalogue, with its exact identifier noted for use in requests.
- A request to
POST /v1/embeddingsreturns one or more embedding vectors through the gateway. - A batch of inputs has been embedded in a single call.
- The returned vectors are understood well enough to feed a semantic-search or RAG pipeline.
- An embedding-model version is pinned so that vectors generated now remain comparable to vectors generated later.
- The embedding call is visible in Request Logs, with the resolved model, token counts, latency, and cost recorded.
Prerequisites
- A working API key with a routing configuration attached, as set up in Route requests across providers. The embeddings endpoint uses the same key and proxy endpoint as chat traffic.
- At least one embedding model enabled in the catalogue. Operators enable embedding models the same way they enable chat models; see Provision custom and self-hosted models. If no embedding model appears in the catalogue, that step has to happen first.
- The gateway's proxy endpoint URL, displayed on the Console Dashboard. The examples below refer to it as
YOUR_GATEWAY_URLand useYOUR_API_KEYfor the key. - A terminal with
curl, or a Python environment with theopenaipackage, for the request steps.
Step 1: select an embedding model from the catalogue
Embedding models are listed in the same model catalogue as chat models, alongside their provider, pricing, and status. The identifier shown there is the value that goes in the model field of an embeddings request, so the first task is to find an enabled embedding model and note its exact name.
- Sign in to the Agent Router Console.
- Open the model catalogue (Catalog → Model Catalog).
- Filter the list to embedding models. Searching for a known family name, for example
embedding, narrows the list quickly, as does filtering by provider. - Confirm the model's Status is enabled. A disabled model cannot be reached even if its identifier is used in a request.
- Note the exact model identifier and the output vector dimension. Both matter downstream: the identifier is sent on every request, and the dimension determines the width of the vector column in the vector store.
Embedding models differ from chat models in what they accept and return. An embedding model takes text and returns a fixed-length vector of floating-point numbers; it does not take a messages array and does not produce a chat completion. Two properties recorded in the catalogue are worth carrying forward:
| Property | Why it matters |
|---|---|
| Model identifier | The value sent in the model field. Vectors are only comparable when they come from the same identifier. |
| Output dimension | The length of every returned vector. The vector store's column width is sized to this value. |
Step 2: call the embeddings endpoint
The gateway exposes an OpenAI-compatible endpoint at /v1/embeddings. The request shape is the standard OpenAI embeddings payload, a model field and an input field, and the API key from the routing setup is presented as a bearer token. The same endpoint serves every enabled embedding model regardless of the upstream provider.
The exact gateway URL is environment-specific and is supplied during installation. The examples below assume YOUR_GATEWAY_URL is substituted with that value, YOUR_API_KEY with the API key, and EMBEDDING_MODEL with the identifier noted in Step 1.
Using curl
curl https://YOUR_GATEWAY_URL/v1/embeddings \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "EMBEDDING_MODEL",
"input": "The quick brown fox jumps over the lazy dog."
}'
Using Python
from openai import OpenAI
client = OpenAI(
base_url="https://YOUR_GATEWAY_URL/v1",
api_key="YOUR_API_KEY",
)
response = client.embeddings.create(
model="EMBEDDING_MODEL",
input="The quick brown fox jumps over the lazy dog.",
)
vector = response.data[0].embedding
print(len(vector))
A successful call returns a response in the same shape the calling code would expect from OpenAI directly. The vector is found under data[0].embedding, and usage.prompt_tokens reports the tokens consumed:
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [0.0023, -0.0091, 0.0145]
}
],
"model": "EMBEDDING_MODEL",
"usage": {
"prompt_tokens": 11,
"total_tokens": 11
}
}
The embedding array above is truncated for readability; a real response contains as many floating-point values as the model's output dimension. For the full embeddings request and response reference, see Supported APIs.
Step 3: embed a batch of inputs in one call
Embedding workloads are rarely one string at a time. Ingesting a corpus means embedding thousands of chunks, and issuing one HTTP request per chunk is slow and wasteful. The input field accepts an array, so a batch of texts is embedded in a single call. Each returned object carries an index that maps it back to its position in the input array, so order is preserved without extra bookkeeping.
Using curl
curl https://YOUR_GATEWAY_URL/v1/embeddings \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "EMBEDDING_MODEL",
"input": [
"The quick brown fox jumps over the lazy dog.",
"A fast auburn fox leaps above a sleepy hound.",
"Interest rates were left unchanged at the latest meeting."
]
}'
Using Python
from openai import OpenAI
client = OpenAI(
base_url="https://YOUR_GATEWAY_URL/v1",
api_key="YOUR_API_KEY",
)
documents = [
"The quick brown fox jumps over the lazy dog.",
"A fast auburn fox leaps above a sleepy hound.",
"Interest rates were left unchanged at the latest meeting.",
]
response = client.embeddings.create(
model="EMBEDDING_MODEL",
input=documents,
)
vectors = [item.embedding for item in sorted(response.data, key=lambda d: d.index)]
print(len(vectors), len(vectors[0]))
Batching reduces request overhead and is counted as a single entry in Request Logs, with token usage aggregated across the batch. Provider limits apply to the number of inputs and the total tokens accepted per call, so very large corpora are split into batches sized to stay within those limits. Work that runs long enough to need scheduling and resumption is better handled as a job; see Run batch and long-running jobs.
Step 4: use the vectors for semantic search and RAG
A vector on its own is not useful; value comes from comparing vectors. Texts with similar meaning produce vectors that sit close together, so similarity between two vectors approximates similarity in meaning. Cosine similarity is the usual measure. The end-to-end shape of a retrieval pipeline is consistent regardless of which embedding model produced the vectors:
- Each document in the corpus is split into chunks sized to the model's input limit and the retrieval granularity required.
- Every chunk is embedded (in batches, as in Step 3) and each vector is stored in a vector store alongside the source text and any metadata.
- At query time, the incoming query is embedded with the same model and version used for the corpus.
- The query vector is compared against the stored vectors, and the nearest matches are retrieved.
- For RAG, the retrieved text is supplied as context to a chat completion sent through the same gateway; see Route requests across providers.
The constraint that governs the whole pipeline is consistency: query vectors and corpus vectors are only comparable when they come from the same model and the same version. Mixing vectors from different models, or from different versions of the same model, produces meaningless distances. That constraint is what makes versioning in the next step a requirement rather than a nicety.
Step 5: pin an embedding-model version
Embedding vectors are comparable only within a single model version. When a provider releases a new version of an embedding model, vectors produced by the new version do not align with vectors produced by the old one; the same text maps to a different point in a different space. A corpus embedded under one version and a query embedded under another will not retrieve correctly, even though both calls succeed and both return vectors of the expected dimension.
Two practices keep a vector store internally consistent:
- Pin a specific version. Where the catalogue exposes a dated or otherwise versioned identifier, that exact identifier is used for every embedding call against a given store, rather than a floating alias that may advance to a newer version. Pinning guarantees that today's query vectors and last month's corpus vectors share the same space.
- Re-embed on a deliberate change. Moving to a different embedding model, or to a new version of the current one, is a corpus-wide operation: the entire corpus is re-embedded with the new model, and queries are switched to it only once that re-embedding is complete. Re-embedding is planned as a migration (run as a batch job, written to a separate index, and cut over atomically) rather than applied piecemeal.
Recording the model identifier and version as metadata alongside each stored vector makes a later migration straightforward: the records embedded under the old version are identifiable, and the cutover can be verified. Because the platform records the resolved model on every request, Request Logs also provides an independent record of which model actually served each embedding call.
Step 6: verify the call in request logs
Issuing an embeddings request is not the same as confirming which model served it and what it cost. Request Logs is the developer-facing record of every request that flowed through the gateway under a given API key, embeddings included.
-
In the Console, open Request Logs (Monitoring → Request Logs).
-
Locate the embeddings request, the most recent entry under the API key used for the call.
-
Expand the row to view the detail panel.
-
Confirm the following fields are populated and consistent with the model selected in Step 1:
Field What to check Resolved model Matches the embedding model identifier sent in the request, including its pinned version. Provider Matches the upstream provider for that model. Token counts Input tokens are present and non-zero; embeddings report prompt tokens only, with no completion tokens. Latency The end-to-end time the gateway observed for the call. Cost The computed cost based on the resolved model and token usage.
A batch call from Step 3 appears as a single entry, with token usage aggregated across every input in the batch. Confirming the resolved model and version here is the fastest way to catch a request that was unintentionally sent to the wrong embedding model, the failure mode that silently corrupts a vector store. The aggregated view of embedding spend and volume by model and key is covered in Monitor traffic and usage.
What to do next
- Route requests across providers: the key and routing setup the embeddings endpoint reuses, and the path for the chat completions that consume retrieved context in a RAG pipeline.
- Run batch and long-running jobs: embed a large corpus or re-embed after a version change as a scheduled, resumable job rather than a single request.
- Provision custom and self-hosted models: the operator-side work that enables an embedding model, including self-hosted and custom embedding endpoints.
- Supported APIs: the full request and response reference for
/v1/embeddingsand the other supported formats.
Where to go next