What's the difference between input and output guardrails?

Input guardrails filter and validate what the user sends to the agent — blocking malicious instructions, exposed sensitive data, or out-of-scope requests. Output guardrails validate what the agent produces before it's executed or presented — blocking destructive actions, incorrect data, or inappropriate content.

Does human-in-the-loop always slow down the product?

Not necessarily. The key is defining which actions need approval and which are safe for full automation. Reversible, low-risk actions (read data, generate text) can be fully autonomous. Actions with permanent effects (send email, execute payment, delete record) should have human confirmation.

Is prompt injection a real problem in production?

Yes, especially in agents that process third-party content (emails, PDFs, web pages). An attacker can include instructions in content the agent will read to try to manipulate its behaviour. Mitigation involves separating content from instructions in the prompt architecture and validating outputs independently.

What tools exist to implement guardrails?

Guardrails AI (Python) is the most mature library for declarative output validation. For TypeScript, the Vercel AI SDK with Zod schemas covers most cases. Anthropic also offers Constitution AI and output prefilling as native techniques. For enterprise, Azure AI Content Safety and AWS Bedrock Guardrails are managed options.

Guardrails for AI agents in B2B products: what needs to exist

Why enterprise AI agents need different controls

A chatbot that answers questions has limited risk: at worst, it gives a wrong answer. An agent that executes actions — sends emails, updates records, calls external APIs — has a completely different risk profile.

In a B2B context, the consequences of a misbehaving agent include: customer data exposure, irreversible actions on critical systems, and regulatory compliance violations. This isn't hype — it's systems engineering.

This article covers the minimum set of guardrails that any B2B product with AI agents must implement before going to production.

Layer 1: Clear scope definition and permissions

The first guardrail is architectural: the agent should only have access to what it needs for its function. The principle of least privilege applied to AI.

In practice: a customer support agent shouldn't have write permissions to the payments database. A contract analysis agent shouldn't be able to send external communications. Permissions are defined before the agent's code.

Implement this control at the infra level (OAuth scopes, API keys with restricted permissions, row-level security) and not just as a system prompt instruction — instructions can be bypassed, infra-level access controls cannot.

Layer 2: Input validation before reaching the model

Everything the user sends to the agent must be validated before entering the model's context. This includes:

Basic sanitisation: maximum length, forbidden characters, expected format
Prompt injection detection: known patterns of instruction manipulation attempts
Intent classification: is the request within the scope defined for the agent?
PII filtering: prevent sensitive data from being sent unnecessarily to the model

Layer 3: Output validation before executing any action

An agent may decide to execute a tool call — call an API, write to a file, send a message. This decision must be validated before execution.

The most robust pattern: structured output with strict schema for tool calls. The agent cannot invoke a tool not in the defined schema, nor with arguments outside the permitted values.

For actions with permanent effects, implement a 'dry run' layer that validates what would happen without actually executing, and logs the decision for auditing.

Layer 4: Human-in-the-loop for high-risk actions

Explicitly define which actions require human approval before executing. The risk classification must come from the product, not the model.

A simple framework for classification:

Green (full automation): read actions, draft generation, analysis without external output
Yellow (implicit confirmation): reversible actions with limited impact — user confirms before execution
Red (explicit approval): sending external emails, payments, data deletion, calls to third-party APIs with permanent effects

Layer 5: Logging and complete traceability

In a B2B context, 'the agent did this' is not a sufficient answer for a customer or regulator. You need a complete trail: which prompt was used, what context was available, what decision the model made, and what action was executed.

The minimum viable logging for agents in production:

Session and request IDs linked to the user and business context
Snapshot of the complete prompt (with injected context) for each execution
Tool calls attempted and executed with the actual arguments
Latency and tokens per step for cost and performance diagnostics
Errors and activated fallbacks

Layer 6: Rate limiting and circuit breakers

Autonomous agents can enter loops — a logic bug, an unexpected edge case, or a malicious user can cause the agent to execute hundreds of actions in seconds.

Implement explicit limits: maximum tool calls per session, maximum external calls per minute, and a circuit breaker that stops the agent and alerts a human when a threshold is reached.

This control is trivial to implement and prevents cost and data incidents that are hard to reverse.

What is not a guardrail

Important to distinguish: a system prompt instruction ('never do X') is not a guardrail — it's a suggestion the model can ignore, especially under adversarial input. Real guardrails are controls outside the model: schema validation, access permissions, human approvals, rate limiting.

A secure system doesn't depend on the model obeying instructions. It depends on the architecture making problematic actions impossible to execute, regardless of what the model decides.