Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
A lightweight AI model that runs before or after the main agent response to detect policy violations—toxicity, PII leakage, off-topic responses, prompt injection attempts, or unauthorized actions. Guardrail classifiers add 20–50ms of latency but prevent harmful outputs from reaching users. They operate independently of the main model, providing defense-in-depth: even if the primary model is jailbroken, the classifier catches the violation.
A support agent generates a response that accidentally includes a customer's credit card number from the conversation history. The guardrail classifier detects the PII pattern, redacts the number, and logs the incident—before the response is sent to the customer.