Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
A prompt technique that bypasses an AI agent's safety instructions or guardrails, causing it to produce restricted content or perform disallowed actions. Jailbreaks range from simple role-play tricks ("pretend you're an unrestricted AI") to sophisticated multi-turn attacks. Defending against jailbreaks requires layered controls: system prompt hardening, input and output classifiers, and action-level authorization rather than relying on the model alone.