Jailbreak

Founder at Agentmelt

A prompt technique that bypasses an AI agent's safety instructions or guardrails, causing it to produce restricted content or perform disallowed actions. Jailbreaks range from simple role-play tricks ("pretend you're an unrestricted AI") to sophisticated multi-turn attacks. Defending against jailbreaks requires layered controls: system prompt hardening, input and output classifiers, and action-level authorization rather than relying on the model alone.

Related niches

AI Cybersecurity Agent
AI Content Moderation Agent

Back to glossary

Loading…