Provision custom, self-hosted, and embedding models
Most provisioning work in Tetrate Agent Router starts with vendor models: the GPT, Claude, and Gemini families served through their respective providers. A point arrives, though, where the organisation's own models matter as much as the vendor catalogue: a fine-tuned model on internal data, a small self-hosted model for thousands of cheap classification calls, or an embedding model for search and retrieval-augmented generation. The platform treats a custom or self-hosted model the same way it treats a vendor model, as a model served by a provider, with the difference only in where the provider points. As long as that endpoint speaks the OpenAI-compatible protocol, the convention the platform's routing layer expects, the model behaves like any other entry in the catalogue once it is enabled. This guide covers registering such a model, the specific case of embedding models, how to version a model so that developers are insulated from churn, and how to confirm the result from the developer side.
Persona: Platform operator working in the Admin Dashboard.
Estimated time: 15--25 minutes per model, plus any upstream work to stand up or secure the endpoint itself.
When this guide applies
This guide is the right reference in any of these situations:
| Situation | Why this guide helps |
|---|---|
| A self-hosted or fine-tuned LLM needs to be routed through the gateway | The model is registered as a custom provider pointing at the endpoint that serves it |
| A small language model is wanted for cheap, low-latency tasks alongside frontier models | The small model is provisioned exactly like any other model and selected through routing rules |
| A search or RAG feature needs an embedding model | Embedding models are provisioned through the same provider-and-model layer, with their own verification path |
| A model is being upgraded and developers should not have to change their code | A stable logical name pinned to an explicit version absorbs the change |
| An embedding model is being replaced | The re-embedding implications have to be understood before the swap, not after |
For provisioning vendor models and the general mechanics of the provider-and-model layer, see Provision models and providers. For pointing a provider at an endpoint reached over a private cloud connection, see Connect provider subscriptions across clouds.
Outcomes
By the end of this guide:
- A custom or self-hosted model that exposes an OpenAI-compatible endpoint is registered as a provider and enabled as a model.
- An embedding model is provisioned and distinguished from chat-completion models in the catalogue.
- A versioning scheme is in place, with explicit versions pinned underneath a stable logical name, so that model upgrades do not break developer code.
- The new model is exposed to the intended audience and no wider.
- The model has been confirmed reachable through a developer-side request.
Prerequisites
- Administrator access to the Admin Dashboard, typically the
super_adminorprovider_adminrole. The role model is covered in Provision models and providers. - An endpoint that serves the model and speaks the OpenAI-compatible protocol: a self-hosted inference server, a fine-tuned deployment, or a third-party host. The endpoint must be reachable from the data plane.
- The credentials the endpoint expects, typically a bearer token or API key. Endpoints that require no authentication are supported but are appropriate only on a trusted private network.
- The logical model name the endpoint accepts in the request body. Most OpenAI-compatible servers expect a
modelfield, and the value entered during provisioning must match what that server recognises. - For endpoints reached over a private cloud connection, the network-level configuration is already in place. See Connect provider subscriptions across clouds.
Step 1: decide what is being added and why
The provisioning steps are the same for every custom model, but the decision that precedes them differs by intent. Three intents are common, and naming the intent up front makes the later choices (audience, versioning, and routing) straightforward.
| Intent | What it serves | What to keep in mind |
|---|---|---|
| Self-hosted or fine-tuned LLM | A model the organisation operates for control, data residency, or domain tuning | Capacity and availability are now the organisation's responsibility; the gateway routes to it but does not run it |
| Small language model | High-volume, latency-sensitive, or cost-sensitive tasks: classification, extraction, routing decisions | Best paired with routing rules that send only the appropriate traffic to it, rather than exposing it as a general-purpose model |
| Embedding model | Search, clustering, and retrieval-augmented generation | Returns vectors rather than text; provisioned the same way but verified and versioned differently (see Step 4) |
A self-hosted model rarely replaces the vendor catalogue. It sits alongside it, and the value comes from developers being able to choose between them through routing configuration rather than through separate integrations. The provisioning work below makes the model available; the selection logic lives on the developer side and is covered in Apply advanced routing rules and Route requests across providers.
Step 2: register the endpoint as a provider
A custom model is reached through a provider entry whose endpoint points at the model's host rather than at a vendor API. The provider carries the connectivity and credentials; the model entry, configured in the next step, carries the visibility decision.
- Sign in to the Admin Dashboard.
- Open the providers surface from the sidebar.
- Start a new provider entry.
- Give the provider a clear identifier and display name that signal it is custom: for example, an identifier such as
acme-internal-llmand a display name such asAcme Internal LLM (self-hosted). A name that distinguishes the entry from the vendor providers prevents later confusion in the catalogue and in analytics. - Set the endpoint to the model's OpenAI-compatible base URL. The base URL is the address the data plane dials; the platform appends the standard OpenAI-compatible paths to it, so the value entered is the root of the API rather than a specific route.
- Select the authentication method the endpoint expects and enter the credential. A bearer token or API key is the common case. Where the endpoint sits behind a private cloud connection, the base URL is the private-link address rather than a public one; see Connect provider subscriptions across clouds.
- Save the provider. The platform verifies the connection and reports the result on the provider entry.
A connection that fails to verify is most often a credential pasted with surrounding whitespace, a base URL that includes a trailing path the platform also appends (producing a doubled route), or an endpoint not reachable from the data plane's network. Each of these is distinguishable from the diagnostic detail on the provider entry, and re-entering the value from a known-good source is the fastest first attempt at recovery.
The OpenAI-compatible protocol is a convention, not a guarantee. Some self-hosted servers implement only a subset of it. A model that completes chat requests but rejects, for example, streaming or function calling will surface those gaps to developers at request time rather than during provisioning. Confirming which capabilities the endpoint actually supports before exposing it widely avoids surprises downstream.
Step 3: enable the model and choose its logical name
Once the provider verifies, the model it serves is registered and enabled. For a custom endpoint, the operator supplies the model identity rather than selecting it from a discovered vendor catalogue.
- Open the models surface from the sidebar.
- Add a model entry against the provider just created.
- Set the model identifier to the logical name developers will call. This is the value that appears in the
modelfield of a developer's request, and it is the contract between the developer and the platform. A clear, stable name such asacme-internal-llmis preferable to one that encodes a version or a hostname. - Set the upstream model name to the value the endpoint itself recognises, if the platform distinguishes the two. The endpoint may expect a different string in its own request body than the logical name developers use; mapping the logical name to the upstream name at this layer is what lets the developer-facing name stay stable across upstream changes.
- Record the model's characteristics where the platform captures them: context window and whether the model serves chat completions or embeddings. Accurate metadata keeps the catalogue and analytics meaningful and helps developers choose the right model.
- Enable the model.
A newly enabled model becomes selectable to developers in line with the audience rules covered in Step 5. Until then it exists in the catalogue but is not yet reachable by the intended consumers.
Step 4: provision an embedding model
An embedding model is provisioned through the same provider-and-model layer, but three things differ and each one matters.
- An embedding model returns vectors, not text. It is called through the embeddings route of the OpenAI-compatible protocol rather than the chat-completions route, and it is useful only to features that consume vectors: semantic search, clustering, and retrieval-augmented generation. It is not a substitute for a chat model and should be labelled clearly so that developers do not select it by mistake.
- The vector dimension is a fixed property of the model. Every vector an embedding model produces has the same length, and that length is part of the contract with whatever vector store holds the results. A vector store provisioned for one dimension cannot hold vectors of another. The dimension is therefore worth recording alongside the model so that developers and operators alike can see it without inspecting a response.
- Embedding output is not portable across models. Vectors from one embedding model are not comparable to vectors from another, even when the dimension happens to match. This is what makes the versioning discipline in Step 4a more than a convenience for embedding models; it is a correctness requirement.
To provision an embedding model, register its endpoint as a provider as in Step 2, then add a model entry as in Step 3, marking the model as an embedding model where the platform captures the distinction. Developers consume the result through the embeddings route; the developer-side mechanics are covered in Generate embeddings.
Step 4a: version models behind a stable logical name
Models change. Vendors release new versions, self-hosted deployments are retrained, and endpoints move. The goal of versioning is that none of this churn reaches developer code. The pattern that achieves it is the same for chat and embedding models:
- Pin explicit versions. Where a model has a version (a vendor's dated revision, an internal training run, or a tag on a self-hosted image), provision it under a model identifier that names the version explicitly. An explicit version is reproducible: a request routed to it today behaves the same as a request routed to it next month.
- Expose a stable logical name as an alias. Alongside the pinned versions, expose one logical name that developers call (
acme-internal-llmrather thanacme-internal-llm-2026-04) and point it at the version the organisation currently considers current. Developers code against the stable name; the operator moves what it points at. An upgrade becomes a single operator action with no developer-side change. - Keep the previous version enabled during a transition. Retiring the old version the instant the alias moves leaves no fallback if the new version misbehaves. Keeping both enabled for a window allows a clean cut-over and an equally clean roll-back.
For embedding models the alias carries one extra obligation. Because vectors are not portable across models, moving an embedding alias to a new model silently invalidates every vector already stored against the old one. New text is embedded with the new model and compared against vectors produced by the old one, and the comparison is meaningless. Changing the embedding model therefore implies re-embedding the corpus: the stored vectors are regenerated with the new model before, or as part of, the cut-over. Plan the re-embedding as part of the upgrade rather than discovering the need for it after search quality has degraded. A practical sequence is to provision the new embedding model under an explicit version, re-embed the corpus into a separate index, and move the alias only once the new index is populated and verified.
Step 5: expose the model to the right audience
A registered, enabled model is governed by the same visibility rules as any other model in the catalogue. Exposing a custom or self-hosted model to the right audience, and no wider, is the same governance lever described for vendor models in Provision models and providers.
- A small or experimental model is often best exposed to a single team or a pilot group before any broader release, so that its behaviour and cost are understood on real traffic first.
- A self-hosted model with finite capacity should be exposed only to the audience that capacity can serve. The gateway routes whatever traffic it is given; it does not protect an undersized endpoint from being overwhelmed.
- An embedding model should be exposed to the teams building search or retrieval features and not offered as a general option, both to avoid misuse as a chat model and to keep the embedding dimension stable for the consumers that depend on it.
Where the model should reach developers, pair the exposure with a notification so that the audience knows it is available and under what name. Where a model is being introduced as a cheaper or faster alternative to an existing one, the cut-over is usually best done gradually on the developer side using the patterns in Apply advanced routing rules.
Step 6: verify from the developer side
Provisioning is confirmed when the model answers a request sent the way a developer would send it. The check differs slightly between a chat model and an embedding model.
- Open the Console and use, or create, an API key whose routing configuration targets the new model by its logical name. The mechanics are covered in Route requests across providers.
- For a chat model, send a short completion request and confirm a coherent text response.
- For an embedding model, send a short text through the embeddings route and confirm that a vector of the expected dimension is returned. A response whose length does not match the recorded dimension points to the wrong model being targeted or the metadata being incorrect.
- Return to the Admin Dashboard and open the usage surface. Within a short delay, the test request appears, attributed to the new model and its provider.
A request that fails with an upstream authentication error points to the provider credential; one that fails because the model is unavailable points to a missed enablement step or an audience rule that excludes the test key; one that reaches the endpoint but is rejected as an unsupported operation points to the OpenAI-compatible gap noted in Step 2. All three are distinguishable from the Console request logs.
What to do next
- Route traffic to the new model deliberately. A custom model earns its place through the routing rules that send the right requests to it. See Apply advanced routing rules and Route requests across providers.
- Build on an embedding model. Once an embedding model is provisioned, the developer-side workflow for producing and consuming vectors is covered in Generate embeddings.
- Connect endpoints over a private cloud link. Where the model's endpoint should be reached without crossing the public internet, see Connect provider subscriptions across clouds.
- Return to the provisioning baseline. For the general model-and-provider mechanics this guide builds on, see Provision models and providers.
Where to go next