Detect and block prompt injection
Prompt injection is the manipulation of a model's behaviour through instructions smuggled into the text it processes, causing it to ignore its system prompt, leak data, or take actions it should not. It is a distinct threat from the harmful-content categories that guardrails address. A content guardrail asks whether a piece of text is unsafe in itself; prompt-injection detection asks whether a piece of text is trying to subvert the model that reads it. The two concerns overlap rarely and must be handled separately.
The threat arrives by more than one path. A direct jailbreak is an injection in the user's own prompt: the familiar "ignore previous instructions" family of attacks, role-play framings, and encoded instructions intended to escape the system prompt. Indirect injection is more dangerous because it does not require a hostile user. It hides instructions in content the model is given to work with: a passage pulled in by retrieval-augmented generation (RAG), a web page summarised on the user's behalf, or the output of a tool the model called. When a Model Context Protocol (MCP) server returns text and that text is fed back to the model, any instructions buried in it are read with the same authority as the rest of the context. A trusted user, asking an ordinary question, can trigger an attack that was planted in a document or a tool response long before.
Tetrate Agent Router applies injection detection inline in the same filter path as its guardrails, inspecting not only inbound prompts but also the retrieved context and the tool and MCP responses that flow back toward the model. This guide covers turning that detection on, scoping it, choosing what happens on a detection (block, sanitize, or flag), and wiring the resulting decision event into an alert.
Persona: Platform operator working in the Admin Dashboard, often alongside the security stakeholders who own the threat model.
Estimated time: 20--40 minutes for an initial configuration, including time spent testing.
When this guide applies
This guide is the right starting point in any of these situations:
| Situation | What it covers |
|---|---|
| Defending against direct jailbreak attempts in user prompts | Enabling detection on inbound prompts and the patterns it covers |
| Defending against instructions hidden in retrieved RAG context | Applying detection to retrieved context before it reaches the model |
| Defending against instructions returned by a tool or MCP server | Applying detection to tool and MCP responses on their way back to the model |
| Choosing what should happen when an injection is detected | Comparing the block, sanitize, and flag responses |
| Alerting when an injection is detected | The decision event emitted on a detection and how it reaches an alert |
For the harmful-content side of safety, and for how injection detection composes with it, see Configure custom guardrails for PII and content and Configure vendor guardrails. Because tool and MCP outputs are a primary injection vector, the way MCP access is governed and aggregated is directly relevant: see Govern MCP server access and Aggregate MCP servers into a profile.
Outcomes
By the end of this guide:
- Injection detection is enabled and applied to at least inbound prompts and tool or MCP responses.
- The patterns the detection covers, common jailbreak framings and indirect-injection attempts, are understood, along with its limits.
- The response on a detection (block, sanitize, or flag) is chosen deliberately for each surface.
- The decision event emitted on a detection is understood, and an alert is configured against it.
- The relationship between injection detection, content guardrails, and data-loss prevention (DLP) is clear.
Prerequisites
- Administrator access to the Admin Dashboard, typically the
super_adminrole, or a role granting guardrail and safety configuration. - At least one provider configured with a healthy connection and at least one model enabled, so detection has live traffic to act on. Provisioning is covered in Provision models and providers.
- For coverage of indirect injection through tools, an understanding of which MCP servers are reachable from the platform and which profiles expose them. See Govern MCP server access and Aggregate MCP servers into a profile.
- A few representative injection samples for the testing step: both a direct jailbreak prompt and an indirect attempt embedded in document-like or tool-output-like text.
Step 1: understand where detection is applied
Injection detection inspects text at the points where untrusted instructions can enter the model's context. Three surfaces matter, and they are not interchangeable.
| Surface | What it inspects | Threat it addresses |
|---|---|---|
| Inbound prompt | The user's request on its way to the model | Direct jailbreaks: instructions the user supplies to escape the system prompt |
| Retrieved context | Passages added to the request by RAG before the model sees them | Indirect injection planted in documents, knowledge bases, or pages the retrieval step pulls in |
| Tool and MCP response | Output returned by a tool or MCP server, on its way back to the model | Indirect injection returned by a tool, a primary vector, because tool output is read with the same authority as the rest of the context |
The inbound prompt is the surface most operators think of first, but it is the least sufficient on its own. A direct jailbreak requires a hostile user; indirect injection does not. The retrieved-context and tool-response surfaces are where a trusted user, asking an ordinary question, can be turned into the delivery mechanism for an attack that was planted elsewhere. Detection that runs only on inbound prompts leaves the indirect paths open.
Tool and MCP responses warrant particular attention. When the platform mediates a model's call to an MCP server, the server's response re-enters the model's context as authoritative text. A compromised or untrusted MCP server, or a legitimate one returning data an attacker controls, can therefore inject instructions without ever touching the user's prompt. Detection on this surface inspects that returning text before the model acts on it. Which MCP servers a given profile may reach is itself a control, covered in Govern MCP server access.
Step 2: understand what the detection covers
Injection detection recognises the patterns characteristic of an attempt to subvert the model, rather than the harmful-content categories a guardrail addresses. The patterns fall into two broad families.
- Direct jailbreak framings: instructions that try to override the system prompt or the model's role. Common forms include explicit override phrasing ("ignore previous instructions", "disregard your rules"), role-play and persona framings that ask the model to assume an unrestricted identity, and instructions encoded or obfuscated to slip past naive matching.
- Indirect-injection patterns: imperative instructions appearing where only data is expected. Text retrieved by RAG or returned by a tool is meant to be information for the model to use, not commands for it to follow. Instructions embedded in that text (directing the model to exfiltrate context, call a tool, or change its behaviour) are the signature of an indirect attack.
Two limits are worth stating plainly, because treating detection as absolute leads to misplaced confidence.
- Detection is heuristic and probabilistic. It raises the cost of a successful injection; it does not reduce it to zero. Novel phrasing and adversarial obfuscation will sometimes evade it, and benign text will sometimes resemble an attack.
- Detection is one layer. It is most effective combined with least-privilege tool access, scoped MCP profiles, and the content and DLP controls described in Step 6, so that an injection that evades detection still cannot reach a high-value action or exfiltrate sensitive data.
Step 3: choose the response on a detection
When an injection is detected, the platform can respond in one of three ways. The response is configurable and should be chosen per surface, because the right answer on an inbound prompt is not always the right answer on a retrieved passage.
| Response | What happens on a detection | When it fits |
|---|---|---|
| Block | The request is rejected, or the offending content is withheld, and the model does not act on it | The detection is high-confidence, and proceeding is unacceptable: the default for direct jailbreaks on inbound prompts |
| Sanitize | The offending instructions are stripped or neutralised and processing continues with the cleaned content | Indirect injection in a retrieved passage or tool response, where the surrounding data is still wanted but the embedded instruction must not be followed |
| Flag | The content passes unchanged, but the detection is recorded as an event | Establishing a baseline rate before enforcing, or monitoring a low-confidence surface without disrupting traffic |
Block is the safest response where a detection means the interaction itself should not proceed, which is the usual case for a clear jailbreak in a user's prompt. Sanitize fits the indirect surfaces: a retrieved document or a tool response often contains legitimate data alongside an injected instruction, and discarding the whole response would break the task, so removing only the instruction preserves the useful content. Flag is the right starting point for any newly enabled surface; it produces the decision event without changing what callers or models experience, which makes it the natural mode for the pilot in Step 5.
Step 4: enable detection and scope it
With the surfaces and responses decided, detection is enabled from the Admin Dashboard.
- Sign in to the Admin Dashboard.
- Open the guardrails and safety surface from the sidebar.
- Create an injection-detection control, distinguishing it from a content guardrail so that its purpose is clear in later review.
- Select the surfaces it inspects: at minimum the inbound prompt and tool or MCP responses, and the retrieved context where RAG is in use.
- Set the response for each surface as decided in Step 3: typically block on inbound prompts and sanitize on the indirect surfaces, or flag everywhere for an initial pilot.
- Scope the control to the traffic it should cover. As with guardrails, a control can apply per model, per routing policy, or platform-wide; injection detection is usually a baseline concern and so is scoped broadly, with narrower exceptions layered on where a specific model or policy needs different handling.
- Save the configuration.
Detection takes effect on subsequent requests within its scope. Requests already in flight complete under the configuration active when they were admitted; there is no service restart.
Scope is a governance decision. Because indirect injection can arrive through any RAG source or any reachable MCP server, a platform-wide baseline is the safest default, narrowed only where a deliberate exception is justified. Pairing broad detection scope with tight MCP access, so that profiles can reach only the servers they need, limits both the chance of an injection and the damage a successful one could do.
Step 5: test with a sample injection
A detection control that has never been exercised against a real injection is an assumption, not a defence. The platform provides a way to evaluate the control against sample content before it is allowed to enforce, mirroring the guardrail testing flow.
- Submit a direct jailbreak sample (for example, a prompt instructing the model to ignore its previous instructions) on the inbound surface, and confirm the control detects it and applies the configured response.
- Submit an indirect-injection sample (text shaped like a retrieved passage or a tool response, with an embedded instruction directing the model to act) on the corresponding surface, and confirm it is detected and that sanitize removes the instruction while preserving the surrounding data.
- Submit benign content that resembles an attack without being one (a document that legitimately discusses prompt injection, for instance) and confirm it is not acted on. This catches the over-eager control that would disrupt real traffic.
- Run the control in flag mode against live traffic for a period before switching any surface to block or sanitize. Flag mode emits the same decision event without affecting callers or models, turning "will this misfire in production?" into an observation rather than a gamble.
Resolving false detections and missed detections at this stage is far cheaper than discovering them once the control is acting on real requests. The decision events produced during this testing are the same events that drive alerting in Step 6.
Step 6: alert on the decision event
Every detection emits a decision event, regardless of the response chosen: block, sanitize, and flag all produce one. The event is what makes injection detection observable and alertable rather than merely active. A flag-mode control that no one is watching provides no protection; the event is how a detection becomes a signal someone can act on.
- The decision event records that an injection was detected, the surface it was detected on (inbound prompt, retrieved context, or tool or MCP response), the response applied, and the request context, without recording the offending payload in a way that would itself become a liability.
- These events are reviewable alongside the platform's other administrative and safety events. The workflow for reviewing them is covered in Audit platform activity.
- For alerting, the events are exported through the platform's observability path to the destination an operator already uses (the OpenTelemetry export described in Export telemetry to an observability stack) where a threshold or a pattern can raise an alert. A detection on a tool or MCP response, in particular, warrants a prompt alert, because it indicates an indirect attack reaching the platform through a backend rather than a user.
Alerting on the decision event is what satisfies the requirement that detection be more than passive: a detection that fires into a log no one watches is indistinguishable from no detection at all. An alert on the event closes that gap.
Step 7: compose detection with content guardrails and DLP
Injection detection is one control among several, and it is most effective as part of a layered safety posture rather than on its own. Each layer addresses a concern the others cannot.
- Injection detection asks whether incoming text, from a user, a retrieved source, or a tool, is trying to subvert the model. It is the subject of this guide.
- Content guardrails ask whether text is unsafe or disallowed in itself. Vendor guardrails cover the provider's notion of safety; custom guardrails cover the organisation's, including keyword and regular-expression rules. See Configure vendor guardrails and Configure custom guardrails for PII and content.
- Data-loss prevention asks whether sensitive data is leaving the boundary. PII detection with a redact action, configured as a custom guardrail, is the platform's DLP mechanism and limits what a successful injection could exfiltrate even if the injection itself evades detection.
These layers operate in the same inline filter path, and all must permit a request for it to proceed. They are complementary by design: injection detection stops the model being subverted, content guardrails stop unsafe content passing in either direction, and DLP stops sensitive data leaving. Deciding which layer owns a given concern, rather than duplicating intent across all three, keeps the overall policy coherent and auditable. Tightening MCP access so that a subverted model can reach only the tools it genuinely needs, covered in Govern MCP server access, reduces the blast radius further still.
What to do next
- Configure custom guardrails: add PII redaction and keyword or regular-expression rules alongside injection detection, including the DLP controls referenced here. See Configure custom guardrails for PII and content.
- Configure vendor guardrails: enable the provider-native safety filters that complement injection detection. See Configure vendor guardrails.
- Govern MCP server access: limit which MCP servers a profile can reach, reducing the indirect-injection surface and the blast radius of a successful attack. See Govern MCP server access.
- Audit platform activity: review the decision events emitted by the detection configured here. See Audit platform activity.
- Protect requests with guardrails: the developer-side view of how safety controls appear in application code. See Protect requests with guardrails.
Where to go next