Loading…
Loading…
When an AI agent can send emails, modify records, spend money, or execute code, every old web-app security category gets a new attack surface plus a few new ones. This checklist covers the controls that separate 'we have an agent in production' from 'we have an agent we can defend in a board meeting.' Treat it as a working document—not every control applies to every agent, but every line should be a decision, not an oversight.
Written by Max Zeshut
Founder at Agentmelt
Treat every input—user message, retrieved document, tool result, file the agent reads—as untrusted. The agent is a confused deputy: it has the user's permissions but does not have the user's judgment about what to do. Apply: input sanitization (strip control characters and zero-width characters that hide [[prompt-injection]] payloads), instruction hierarchy (system prompt > tool definitions > user input > tool results), and output sanitization before any action (especially for tool-call arguments and outbound messages). Never let a tool result silently rewrite the agent's instructions.
Every tool the agent can call is a potential blast radius. Audit your tool list and reduce: does the agent need write access, or is read-only enough for this workflow? Does it need access to all customers, or only the customer in the current session (row-level security)? Does it need to delete, or only mark deleted? Use scoped API tokens, per-tool rate limits, and explicit allowlists rather than 'all endpoints on this service.' This is the single highest-leverage [[blast-radius]] control you can apply.
For any irreversible or high-value action—sending money, sending external email, deleting records, granting access, publishing public content—require explicit authorization. Patterns: dollar-amount thresholds that trigger human approval, [[human-in-the-loop]] for any action above the threshold, two-key approval for the highest-stakes operations, and idempotency keys so retries can't double-execute. The right rule isn't 'never let the agent act'—it's 'know exactly what actions are pre-authorized, and require gates for everything else.'
Never put long-lived credentials in the agent's system prompt or context window. Use short-lived tokens, fetched on demand from a secrets manager, scoped to the action being performed. Rotate any token that may have appeared in a model call within 24 hours of detection. Log credential use (which agent ran which action with which token) so you can audit and revoke. Watch out for the model echoing credentials back in tool calls—your egress filter should treat any string matching credential patterns as suspect.
[[tool-poisoning]] is the prompt-injection vector you don't see in user input: malicious instructions hidden in a tool's description, returned data, or MCP server metadata. Mitigations: only install MCP servers from sources you trust, treat tool descriptions as code (review on update like you'd review a dependency upgrade), scan tool outputs for instruction-like patterns before passing them to the model, and reject hidden Unicode in any string field that influences the agent. The same scrutiny you give npm dependencies applies to MCP servers in 2026.
Capture an [[agent-trace]] for every production run: input, system prompt version, model, every tool call with arguments and results, every decision the model made, every guardrail outcome, final action, and any human gates that fired. Retain traces for at least 90 days; longer for regulated industries. The trace is what lets you answer 'what did the agent do for customer X on Tuesday and why?'—the single most common question after an incident. Without it, you're improvising answers to your CISO under pressure.
Before any customer-facing launch, run [[red-teaming]] against the agent: adversarial prompts, jailbreak attempts, malformed tool results, edge-case inputs that have caused failures in similar agents at other companies. Maintain a library of red-team cases that grow over time—every real-world incident becomes a permanent test. Run the library against every model change and prompt change before deploy. Teams without a red-team library re-discover the same vulnerabilities every release.
For internal-only, low-stakes agents (draft generators, summarizers, research assistants), no—your standard application security review with an AI-aware checklist is enough. For agents that take real-world action on customer-facing or money-moving workflows, yes: a dedicated review by someone trained in AI-specific risks (prompt injection, jailbreaks, tool poisoning, model-output exfiltration) before any production launch. Most enterprises end up creating a lightweight AI security review process distinct from their standard appsec process by 2026.
Granting the agent broad write access on day one because it makes the demo easier, and never tightening it before production. The fix is cultural: treat 'least privilege' as a launch requirement, not a future improvement. Every tool the agent can call should have a justification, and that justification should be reviewable.
Minimize and contain. Don't pass full customer records when an identifier and a few relevant fields will do. Use PII redaction at the LLM provider boundary if your provider supports it; otherwise, redact in your application layer before the model sees the data. Log retention for any trace containing PII should match your existing data-handling policy. For high-sensitivity domains (healthcare, financial), consider [[self-hosted-llm]] options where the data never leaves your infrastructure.