How many test cases do I need?

Minimum for a real eval: 50-100 cases. This generator produces 35 starter cases across 4 categories (happy, edge, adversarial, boundary). Add another 15-50 drawn from your actual conversation logs — those are the cases that catch the bugs unique to your domain. Re-run the eval on every prompt change and every model upgrade.

What's the difference between happy, edge, adversarial, and boundary tests?

Happy-path = the requests users actually make on a good day. Edge = ambiguous, partial, malformed, or off-topic inputs. Adversarial = prompt injection, jailbreak attempts, and social engineering. Boundary = operational limits like maximum input length, concurrent load, tool failures, and multilingual input. All four are needed for a production-ready eval.

How do I run these tests?

Open-source frameworks make it easy: Promptfoo (YAML-based), DeepEval (Python), Ragas (RAG-specific), or commercial platforms like Braintrust and LangSmith. Pipe the generated JSON into your eval pipeline; score each model response against the 'expected' and 'pass criteria' fields. Failing tests in any category should block deployment.

Why are adversarial tests so important?

Most production AI agent incidents (PII leaks, prompt injection, wrong-customer data, jailbreak fallout) trace back to adversarial cases that weren't tested before launch. Treat adversarial pass rate as a release gate, not a nice-to-have. A single compliant response to an adversarial prompt should fail the build.

Can I customize the test cases?

Yes — after downloading the JSON, edit it to add domain-specific cases (e.g. 'when a user asks about [your specific product feature]'). The starter set is generic by design; the magic is in the cases unique to your product, customer base, and risk profile.

Where do these test patterns come from?

Adversarial templates draw from the OWASP LLM Top 10 (2024-25) and Lakera's prompt-injection corpus. Edge-case patterns come from production conversation-log analyses we've run for AI agent migrations. Boundary tests draw from Anthropic and OpenAI's published safety evaluation methodologies.

AI Agent Eval Set Generator

Generate 35+ test cases for your AI agent — happy-path, edge, adversarial, boundary. Download JSON or CSV and plug into Promptfoo, DeepEval, Braintrust, or LangSmith.

Loading generator…

Frequently asked questions

How many test cases do I need?
Minimum for a real eval: 50-100 cases. This generator produces 35 starter cases across 4 categories (happy, edge, adversarial, boundary). Add another 15-50 drawn from your actual conversation logs — those are the cases that catch the bugs unique to your domain. Re-run the eval on every prompt change and every model upgrade.
What's the difference between happy, edge, adversarial, and boundary tests?
Happy-path = the requests users actually make on a good day. Edge = ambiguous, partial, malformed, or off-topic inputs. Adversarial = prompt injection, jailbreak attempts, and social engineering. Boundary = operational limits like maximum input length, concurrent load, tool failures, and multilingual input. All four are needed for a production-ready eval.
How do I run these tests?
Open-source frameworks make it easy: Promptfoo (YAML-based), DeepEval (Python), Ragas (RAG-specific), or commercial platforms like Braintrust and LangSmith. Pipe the generated JSON into your eval pipeline; score each model response against the 'expected' and 'pass criteria' fields. Failing tests in any category should block deployment.
Why are adversarial tests so important?
Most production AI agent incidents (PII leaks, prompt injection, wrong-customer data, jailbreak fallout) trace back to adversarial cases that weren't tested before launch. Treat adversarial pass rate as a release gate, not a nice-to-have. A single compliant response to an adversarial prompt should fail the build.
Can I customize the test cases?
Yes — after downloading the JSON, edit it to add domain-specific cases (e.g. 'when a user asks about [your specific product feature]'). The starter set is generic by design; the magic is in the cases unique to your product, customer base, and risk profile.
Where do these test patterns come from?
Adversarial templates draw from the OWASP LLM Top 10 (2024-25) and Lakera's prompt-injection corpus. Edge-case patterns come from production conversation-log analyses we've run for AI agent migrations. Boundary tests draw from Anthropic and OpenAI's published safety evaluation methodologies.

Loading…

Frequently asked questions

How many test cases do I need?
Minimum for a real eval: 50-100 cases. This generator produces 35 starter cases across 4 categories (happy, edge, adversarial, boundary). Add another 15-50 drawn from your actual conversation logs — those are the cases that catch the bugs unique to your domain. Re-run the eval on every prompt change and every model upgrade.
What's the difference between happy, edge, adversarial, and boundary tests?
Happy-path = the requests users actually make on a good day. Edge = ambiguous, partial, malformed, or off-topic inputs. Adversarial = prompt injection, jailbreak attempts, and social engineering. Boundary = operational limits like maximum input length, concurrent load, tool failures, and multilingual input. All four are needed for a production-ready eval.
How do I run these tests?
Open-source frameworks make it easy: Promptfoo (YAML-based), DeepEval (Python), Ragas (RAG-specific), or commercial platforms like Braintrust and LangSmith. Pipe the generated JSON into your eval pipeline; score each model response against the 'expected' and 'pass criteria' fields. Failing tests in any category should block deployment.
Why are adversarial tests so important?
Most production AI agent incidents (PII leaks, prompt injection, wrong-customer data, jailbreak fallout) trace back to adversarial cases that weren't tested before launch. Treat adversarial pass rate as a release gate, not a nice-to-have. A single compliant response to an adversarial prompt should fail the build.
Can I customize the test cases?
Yes — after downloading the JSON, edit it to add domain-specific cases (e.g. 'when a user asks about [your specific product feature]'). The starter set is generic by design; the magic is in the cases unique to your product, customer base, and risk profile.
Where do these test patterns come from?
Adversarial templates draw from the OWASP LLM Top 10 (2024-25) and Lakera's prompt-injection corpus. Edge-case patterns come from production conversation-log analyses we've run for AI agent migrations. Boundary tests draw from Anthropic and OpenAI's published safety evaluation methodologies.

AI Agent Eval Set Generator

Frequently asked questions

How many test cases do I need?

What's the difference between happy, edge, adversarial, and boundary tests?

How do I run these tests?

Why are adversarial tests so important?

Can I customize the test cases?

Where do these test patterns come from?

AI Agent Eval Set Generator

Get the full report

Frequently asked questions

How many test cases do I need?

What's the difference between happy, edge, adversarial, and boundary tests?

How do I run these tests?

Why are adversarial tests so important?

Can I customize the test cases?

Where do these test patterns come from?

Get the full report