Run batch and long-running jobs
Not all AI work is interactive. A large share of it is offline: scoring an overnight queue of support tickets, generating embeddings for a million-document corpus, running an evaluation suite across every model under consideration. Batch processing fits this shape of work: instead of one request per item held open until it returns, a single file of requests is submitted, processed asynchronously by the provider, and collected later as a single file of results. Tetrate Agent Router exposes this through an OpenAI-compatible batch interface, so the same gateway, API key, and observability surfaces that handle synchronous traffic also handle batch traffic. This guide covers when batching is the right choice, how a batch job is submitted, polled, and retrieved, how large workloads are structured into manageable batches, how batch traffic interacts with rate limits and budgets, and how it appears in the Console's monitoring surfaces.
Persona: Developer working in the Agent Router Console and in application code.
Estimated time: 20--30 minutes to submit a first batch and retrieve its results; batch completion itself runs asynchronously and may take from minutes to hours depending on size and provider.
When this guide applies
Batching fits high-volume, latency-tolerant work. Synchronous and streaming calls remain the right tool when a human or a downstream system is waiting on the answer. The distinction usually comes down to who, or what, is blocked on the result.
| Situation | Recommended approach |
|---|---|
| An offline corpus of thousands of items is processed on a schedule | Batch |
| Embeddings are generated in bulk for a vector index | Batch |
| An evaluation suite is run across many prompts or many models | Batch |
| A nightly or weekly summarisation, classification, or enrichment job | Batch |
| A user is waiting on a response in a UI | Synchronous, often streaming |
| Tokens should appear progressively as they are produced | Streaming |
| A single ad-hoc request is being tested or debugged | Synchronous |
| Per-item latency matters more than throughput or unit cost | Synchronous |
The decision is rarely permanent. A pipeline that starts as a synchronous loop during prototyping is a natural candidate to convert to batch once its volume grows and its latency tolerance becomes clear.
Outcomes
By the end of this guide:
- The conditions under which batch processing is preferable to synchronous or streaming calls are understood.
- A batch input file has been constructed in the OpenAI-compatible JSONL format and submitted through the gateway.
- A batch job's status has been polled to completion and its output file retrieved and parsed.
- A long-running workload has been structured into appropriately sized batches.
- The interaction between batch traffic, per-key rate limits, and budgets is understood.
- Batch traffic has been located in Request Logs and Usage Analytics.
Prerequisites
- A working API key with a routing configuration attached, as set up in Route requests across providers.
- Confirmation that the target model is enabled in the platform and that its upstream provider supports batch processing. Batch availability is provider-specific; a model that serves synchronous traffic does not necessarily expose a batch endpoint.
- A terminal with
curl, or a Python environment with theopenaipackage, for the submission and retrieval steps. - For bulk-embedding workloads, familiarity with the embeddings endpoint described in Supported APIs.
The exact gateway URL is environment-specific and is supplied during installation. The examples below assume YOUR_GATEWAY_URL is substituted with that value, and YOUR_API_KEY with a key from the Console.
Step 1: decide whether to batch
The first decision is whether the workload belongs in a batch at all. Three properties together make a workload a good fit:
- Volume: the work consists of many independent items, typically hundreds to millions, rather than a handful.
- Latency tolerance: nothing is blocked waiting on an individual result. The job can complete minutes or hours after submission without affecting a user or a time-sensitive process.
- Independence: each item is self-contained. Batch processing does not preserve any conversational state between items, and items are not guaranteed to be processed in submission order.
Where all three hold, batching converts a fragile, rate-limit-prone loop into a single asynchronous submission. Where any one of them fails (a user is waiting, the items depend on each other, or there are only a few of them), a synchronous call through /v1/chat/completions, or a streaming call for progressive output, remains the correct choice. The synchronous request path is the subject of Route requests across providers; streaming is documented in Supported APIs.
Step 2: build the batch input file
The gateway accepts batch input in the OpenAI-compatible JSONL format: one JSON object per line, each describing a single request. Every line carries a custom_id used to correlate the eventual result back to its input, the HTTP method and url of the endpoint being called, and a body containing the request payload that would otherwise be sent synchronously.
A chat-completions batch with two requests:
{"custom_id": "ticket-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Classify this ticket: cannot reset password"}]}}
{"custom_id": "ticket-002", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o", "messages": [{"role": "user", "content": "Classify this ticket: invoice shows wrong amount"}]}}
The custom_id must be unique within the file. Because results are not guaranteed to return in input order, the custom_id is the only reliable way to match an output line back to the request that produced it; deriving it from a stable key in the source data (a record ID, a document path) avoids a brittle reliance on line position.
The body of each line is the same payload accepted by the corresponding synchronous endpoint, so any request shape valid for /v1/chat/completions is valid here. The model field within each body is subject to the same routing configuration as a synchronous request; the platform resolves it according to the rules described in Route requests across providers.
Generating the file programmatically keeps it consistent at scale:
import json
records = load_source_records() # application-specific
with open("batch_input.jsonl", "w") as f:
for record in records:
line = {
"custom_id": record["id"],
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o",
"messages": [
{"role": "user", "content": record["prompt"]},
],
},
}
f.write(json.dumps(line) + "\n")
Step 3: submit the batch job
Submitting a batch is a two-stage operation: the input file is uploaded first, then a batch job is created that references the uploaded file by its identifier.
Upload the input file
curl https://YOUR_GATEWAY_URL/v1/files \
-H "Authorization: Bearer YOUR_API_KEY" \
-F purpose="batch" \
-F file="@batch_input.jsonl"
The response contains a file identifier, conventionally prefixed file-. That identifier is the handle for the next call.
Create the batch
curl https://YOUR_GATEWAY_URL/v1/batches \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input_file_id": "file-abc123",
"endpoint": "/v1/chat/completions",
"completion_window": "24h"
}'
The same two stages in Python:
from openai import OpenAI
client = OpenAI(
base_url="https://YOUR_GATEWAY_URL/v1",
api_key="YOUR_API_KEY",
)
input_file = client.files.create(
file=open("batch_input.jsonl", "rb"),
purpose="batch",
)
batch = client.batches.create(
input_file_id=input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(batch.id, batch.status)
The create call returns immediately with a batch identifier and an initial status; the work itself proceeds asynchronously on the provider side. The endpoint field declares the endpoint every line in the file targets, and completion_window expresses the window within which completion is requested. The application does not hold a connection open for the duration; the batch identifier is the only state that needs to be retained.
Batch support and the specific values accepted for endpoint and completion_window are provider-dependent. If a batch creation call is rejected, the most common cause is that the resolved model's upstream provider does not offer a batch endpoint for the requested endpoint type. Confirm batch availability for the target model before building a pipeline around it.
Step 4: poll status and retrieve results
A batch job moves through a sequence of states (typically validating, in progress, finalising, and completed), and the application discovers its progress by polling rather than by holding a connection open. Failed, expired, and cancelled are also terminal states that any poller must handle.
import time
while True:
batch = client.batches.retrieve(batch.id)
print(batch.status, batch.request_counts)
if batch.status in ("completed", "failed", "expired", "cancelled"):
break
time.sleep(30)
A polling interval of roughly 30 seconds to a few minutes is appropriate; tighter intervals add load without materially improving the time to discover completion. The retrieve response also reports request counts (total, completed, and failed), which gives a coarse sense of progress while the job is in flight.
Once the status reaches completed, the batch object carries an output_file_id for successful results and, where any requests failed, an error_file_id for the failures. Both are retrieved through the same files interface used for the upload:
output = client.files.content(batch.output_file_id)
with open("batch_output.jsonl", "wb") as f:
f.write(output.read())
The output file is JSONL, one line per processed request. Each line echoes the custom_id from the input alongside the response body, which is why a stable custom_id matters: results are correlated by that field, not by position. A line whose request failed carries an error rather than a response, so each line is checked individually rather than assuming the whole batch succeeded or failed as a unit.
import json
with open("batch_output.jsonl") as f:
for line in f:
result = json.loads(line)
custom_id = result["custom_id"]
if result.get("error"):
handle_failure(custom_id, result["error"])
else:
handle_success(custom_id, result["response"]["body"])
A batch that has not yet completed can be cancelled if the work is no longer needed:
curl -X POST https://YOUR_GATEWAY_URL/v1/batches/batch_abc123/cancel \
-H "Authorization: Bearer YOUR_API_KEY"
Step 5: structure long-running and large workloads
A workload of a few thousand items maps cleanly onto a single batch. Larger workloads (bulk embeddings for a large corpus, an evaluation sweep across many models and prompts, or an enrichment pass over a full dataset) benefit from being split into several smaller batches rather than submitted as one enormous file. Smaller batches fail in smaller, more recoverable units; they make progress observable as each batch completes; and they sidestep any per-file size or request-count ceilings the upstream provider enforces.
A few patterns make large workloads manageable:
- Chunk the input: the source dataset is divided into batches of a consistent size (for example, a few thousand to tens of thousands of requests each), and each chunk is submitted as its own job. The chunk index is folded into the
custom_idso that results from different batches remain unambiguously correlated. - Track batch identifiers durably: each batch identifier is persisted alongside the chunk it represents, so a long-running job survives a restart of the submitting process. Recovery becomes a matter of re-polling known identifiers rather than resubmitting work.
- Process results as each batch completes: rather than waiting for every batch, each output file is consumed as its batch reaches
completed, which spreads the downstream work and surfaces problems early. - Bulk embeddings: an embeddings workload follows the same shape, with each line targeting
/v1/embeddingsand itsbodycarrying the input text. The endpoint is described in Supported APIs. - Evaluations: an evaluation sweep is expressed as a batch whose lines vary the
modelfield across the candidates under test, with acustom_idencoding both the prompt and the model so that results can be pivoted by either dimension afterwards.
Splitting also interacts favourably with rate limits and budgets, which is the subject of the next step: several moderate batches submitted in sequence are easier to keep within a key's quota than one batch large enough to exhaust it.
Step 6: understand rate limits and budgets
Batch traffic is metered the same way as synchronous traffic. The tokens consumed by every request in a batch count against the same per-key rate limits and contribute to the same usage totals; submitting work as a batch does not exempt it from a key's quota. A per-key rate limit set on a rolling hourly window can therefore reject batch requests with a 429 response in the same way it rejects synchronous ones, particularly when a large batch lands within a single window.
Several practices keep batch work inside its budget:
- Size batches against the key's limit: where a key carries an hourly token ceiling, batches are sized so that a single submission does not blow through the window. Splitting a large workload across several batches, as in Step 5, is the primary lever.
- Isolate batch work on its own key: a dedicated key for batch pipelines keeps their consumption separate from interactive traffic, so a heavy overnight job cannot starve a user-facing path of its quota, and the cost of the batch work is attributable on its own.
- Treat the provider's batch discount as part of the budget: many providers price batch work below synchronous work. Where that discount applies, the saving shows up in the usage totals for the batch key and can be planned for rather than discovered after the fact.
The mechanics of per-key rate limits (the rolling window, the independent token sliders, and the 429 behaviour) and the broader discipline of bounding spend are covered in Working with budgets. Coordinate with the platform operator who owns those limits before pointing a high-volume batch pipeline at a production key.
Step 7: observe batch traffic in the Console
Batch requests flow through the same gateway as synchronous requests, so they appear in the same monitoring surfaces with the same detail. Once a batch has completed, its constituent requests are visible in the Console.
- Request Logs: each request within a batch is recorded as an individual row, with its resolved model, token counts, latency, cost, and status, exactly as a synchronous request would be. Filtering by the batch key isolates the batch traffic from everything else; scanning the status column surfaces any requests that failed within an otherwise successful batch.
- Usage Analytics: the aggregate view attributes the batch's volume and spend to the submitting key, broken down by model and provider. With batch work isolated on its own key, the cost of an entire batch run is read directly from the per-key totals.
Because batch and synchronous traffic share these surfaces, the API-key-per-purpose convention is what keeps them distinguishable: a key reserved for batch pipelines turns "how much did last night's job cost" into a single filtered reading rather than a forensic exercise. The full treatment of both surfaces, including the per-purpose key pattern, is in Monitor traffic and usage.
What to do next
- Monitor traffic and usage: inspect the individual requests a batch produced and read its aggregate cost in Usage Analytics.
- Working with budgets: set and tune the per-key rate limits that batch traffic is metered against.
- Supported APIs: the full endpoint reference, including the embeddings endpoint used for bulk-embedding batches and the streaming surfaces that batch deliberately trades away.
- Integrate the gateway with an app: wire the batch submission and retrieval flow into an application's SDK configuration.
Where to go next