Glossary

What are agent guardrails?

Agent guardrails are the mechanisms that constrain what an AI agent backed by a large language model is allowed to do. They sit at four layers: input, model, output, and tool calls. Each layer catches a different class of misuse with a different degree of confidence.

The interesting question is not whether to have guardrails. It is which layer carries the load. The first three operate on text and can in principle be defeated by a prompt injection. The fourth operates on the action and does not.

The four kinds of guardrails

Input guardrails. Code that runs before the prompt reaches the model. Pattern matching, classifiers trained to spot adversarial prompts, allow-lists on topics. Raises the cost of obvious attacks. Catches a lot. Cannot catch what it has not seen.

Model guardrails. The model's own refusal behaviour, shaped by system prompts, fine-tuning, and safety training. The first thing most teams ship and often the only thing. Effective against clumsy prompts; defeatable by a creative one, because the model itself is what is being attacked.

Output guardrails. Code that runs after the model produces a response. Strip personal data, block disallowed content, refuse output that names confidential systems. Useful, but the response has already happened: a refusal here is a fallback, not a prevention.

Tool-call guardrails. Code between the agent and the system of record. When the agent asks to call a tool, this layer evaluates the call against the rules and either executes or refuses. It does not read the chat. It cannot be argued with by a prompt, because the prompt does not reach it.

Which layer holds when the prompt is broken

The first three layers all run as text against text. Each one reduces probability. None of them yields a guarantee, because the attack surface is exactly the same as the defence surface: language.

The tool-call layer is structurally different. It is code, not language. It runs on typed parameters, not on the text of the conversation. A rule like "discount percentage must be at most 30" can be checked the same way every time, regardless of how persuasive the user was upstream. The model can be talked into asking; the layer decides whether the ask passes.

That is the property that makes deterministic execution the load-bearing kind of guardrail for agents connected to real systems.

Defence in depth, honestly

All four layers belong in a production stack. Input filters cut volume. Model guardrails keep the easy cases easy. Output filters catch what slipped through. The tool-call layer keeps the consequences bounded.

The honest framing: defence in depth works because the bottom layer does not depend on the layers above it succeeding. If the bottom layer is also probabilistic, depth is just more of the same thing. If the bottom layer is deterministic, depth is meaningful.

This is the inversion most "AI security" stacks miss. They stack more text-layer defences on top of each other. A jailbroken model that has been told ten times not to do a thing is still a jailbroken model.

How dhino implements tool-call guardrails

dhino exposes data and business actions to AI agents through Model Context Protocol tools. Each tool is a named action with typed parameters. The agent never reaches the underlying database; it can only call the actions the platform has exposed.

Above each action sits a policy enforcement layer: per-agent scoping (which tools may this agent call), per-call rules on parameters (a discount must be no more than 30%, a date range must fit a window), and per-call audit (input parameters and result recorded for review).

See dhino Trust for the product overview, and dhino vs Microsoft's Dataverse MCP server for how this differs from a connector that exposes raw CRUD.

Common questions about agent guardrails

What are the different types of AI agent guardrails?

Four broad categories. Input guardrails scan or transform what reaches the model. Model guardrails are the model’s own refusal behaviour from system prompts and safety training. Output guardrails scan or transform what leaves the model. Tool-call guardrails are deterministic rules in code, between the agent and the system of record, that decide whether each action is allowed. The first three operate on text; only the last operates on the action.

Do AI guardrails actually work?

It depends on the kind. Text-layer guardrails reduce the probability of misuse but cannot prove a determined prompt will not get through. Tool-call guardrails, because they run in code outside the model, decide the same way every time regardless of how the model was prompted. For an agent that takes real action on enterprise data, this is the only kind whose outcome is auditable.

What is the difference between a guardrail and a policy?

Loose usage treats them as synonyms. The useful distinction: a guardrail is any mechanism that constrains an LLM agent’s behaviour, including text-layer filters and model-level refusals. A policy is a specific rule about what an action may or may not do, enforced in code at the action layer. Every policy is a guardrail; not every guardrail is a policy.

How do guardrails fit into defence in depth for LLM agents?

Each layer of guardrails catches a different failure mode at a different probability. Input and output filters raise the cost of obvious attacks. Model-level refusals catch most clumsy jailbreaks. The tool-call layer is the last line of defence: even if everything above it fails, an action with rejected parameters does not execute. Defence in depth holds when the bottom layer is deterministic, because it does not depend on the layers above it succeeding.

See deterministic guardrails in practice

Tell us where one of your AI agents calls a real action. We will show you the difference between a guardrail in the prompt and one in the layer below it.