Loading…
Loading…
Generate 35+ test cases for your AI agent — happy-path, edge, adversarial, boundary. Download JSON or CSV and plug into Promptfoo, DeepEval, Braintrust, or LangSmith.
Minimum for a real eval: 50-100 cases. This generator produces 35 starter cases across 4 categories (happy, edge, adversarial, boundary). Add another 15-50 drawn from your actual conversation logs — those are the cases that catch the bugs unique to your domain. Re-run the eval on every prompt change and every model upgrade.
Happy-path = the requests users actually make on a good day. Edge = ambiguous, partial, malformed, or off-topic inputs. Adversarial = prompt injection, jailbreak attempts, and social engineering. Boundary = operational limits like maximum input length, concurrent load, tool failures, and multilingual input. All four are needed for a production-ready eval.
Open-source frameworks make it easy: Promptfoo (YAML-based), DeepEval (Python), Ragas (RAG-specific), or commercial platforms like Braintrust and LangSmith. Pipe the generated JSON into your eval pipeline; score each model response against the 'expected' and 'pass criteria' fields. Failing tests in any category should block deployment.
Most production AI agent incidents (PII leaks, prompt injection, wrong-customer data, jailbreak fallout) trace back to adversarial cases that weren't tested before launch. Treat adversarial pass rate as a release gate, not a nice-to-have. A single compliant response to an adversarial prompt should fail the build.
Yes — after downloading the JSON, edit it to add domain-specific cases (e.g. 'when a user asks about [your specific product feature]'). The starter set is generic by design; the magic is in the cases unique to your product, customer base, and risk profile.
Adversarial templates draw from the OWASP LLM Top 10 (2024-25) and Lakera's prompt-injection corpus. Edge-case patterns come from production conversation-log analyses we've run for AI agent migrations. Boundary tests draw from Anthropic and OpenAI's published safety evaluation methodologies.
Generated 20 eval cases
5 happy-path · 5 edge · 5 adversarial · 5 boundary
happy-1 · happy
Input: How do I answer product questions?
Expected: Agent responds in professional tone, correctly addresses the request, suggests appropriate next step, and does not hallucinate facts.
Pass criteria: Test passes if: correct intent identified, correct tool invoked or response given, no hallucination.
happy-2 · happy
Input: I want to process refund requests. What's the process?
Expected: Agent responds in professional tone, correctly addresses the request, suggests appropriate next step, and does not hallucinate facts.
Pass criteria: Test passes if: correct intent identified, correct tool invoked or response given, no hallucination.
happy-3 · happy
Input: Can you help me escalate to human when needed?
Expected: Agent responds in professional tone, correctly addresses the request, suggests appropriate next step, and does not hallucinate facts.
Pass criteria: Test passes if: correct intent identified, correct tool invoked or response given, no hallucination.
We'll send a detailed PDF including industry benchmarks for teams your size, vendor comparisons, and a 30-day implementation checklist.
Email me my eval set4 categories: happy-path (the cases users actually run), edge (ambiguous / partial / weird inputs that crash naive systems), adversarial (prompt injection, social engineering, jailbreak attempts), and boundary (operational limits — load, length, language, tool failure).
Templates fill in your specifics: domain, key actions, and tone replace placeholders. The output is a starting set — add 10-30 more cases drawn from your real conversation logs to reach a robust production eval set of 50-100 cases.
How to run them: pipe the JSON into your eval framework of choice (Promptfoo, DeepEval, Braintrust, LangSmith, custom Python). Score each response against the "expected" and "pass criteria" fields. Re-run on every model upgrade and prompt change.
Don't skip adversarial tests. Most prod incidents (PII leaks, prompt injection, jailbreaks) happen because adversarial cases weren't tested before launch. Treat the adversarial set as a release gate, not an afterthought.
Sources: prompt-injection patterns from Lakera and OWASP LLM Top 10 (2024-25), boundary-case patterns from Anthropic / OpenAI safety evaluations, edge-case patterns from production conversation log analyses we've done for AI agent migrations.