How AI Coding Agents Generate Unit Tests (and When to Trust Them)

Writing unit tests is one of the most common—and most effective—uses of AI coding agents. The tedious part of testing isn't figuring out what to test; it's writing the boilerplate, mocking dependencies, and covering edge cases. AI agents handle this well.

But they also make predictable mistakes. Here's how to use them effectively.

What AI coding agents do well

Boilerplate and setup

AI agents excel at generating the scaffolding: imports, test file structure, describe/it blocks, beforeEach hooks, mock setup, and teardown. This is pure mechanical work that follows patterns.

Happy path coverage

Given a function, agents reliably generate tests for the expected behavior. They read the function signature, understand the return type, and write assertions that verify the function does what it's supposed to do.

Edge case enumeration

Agents are surprisingly good at identifying edge cases: null inputs, empty arrays, boundary values, type coercion issues, and off-by-one errors. They generate more edge cases than most developers think of on a first pass.

Mock generation

For functions with dependencies (database calls, API requests, file I/O), agents generate mocks and stubs that match the expected interface. They handle common patterns like Jest mocks, Sinon stubs, and dependency injection.

What they get wrong

Testing implementation, not behavior

AI agents sometimes test how a function works internally rather than what it produces. Tests that assert on internal method calls or specific execution order are brittle and break during refactoring.

Fix: Review generated tests and ask: "Would this test still pass if I refactored the implementation but kept the same behavior?" If not, rewrite the assertion.

Hallucinated API surfaces

When testing code that interacts with external libraries, agents sometimes hallucinate method signatures or configuration options that don't exist. The test looks right but fails because the mock doesn't match reality.

Fix: Run the tests immediately after generation. Compilation errors and missing method failures catch these quickly.

Insufficient assertions

Agents sometimes write tests that exercise the code but don't assert meaningful outcomes. A test that calls a function and checks it "doesn't throw" provides very little value.

Fix: Check that each test asserts on the return value, side effects, or state changes that matter. If a test has no assertions (or only checks truthiness), strengthen it.

Over-mocking

Agents tend to mock aggressively—sometimes mocking the very thing you're trying to test. Over-mocked tests pass regardless of whether the underlying code works.

Fix: Only mock external boundaries (network, database, file system). Let internal modules run as-is.

A practical workflow

Here's the workflow that teams report the best results with:

1. Generate tests from the function

Point your coding agent at a function and ask for unit tests. Be specific about the framework (Jest, pytest, etc.) and testing patterns your project uses.

2. Run immediately

Don't review AI-generated tests manually first. Run them. Failing tests reveal hallucinations and incorrect assumptions faster than reading.

3. Fix failures

Fix the obvious issues: wrong imports, hallucinated APIs, missing mocks. This usually takes 2–5 minutes per file.

4. Review assertions

Now read the passing tests. Are they asserting meaningful outcomes? Remove or rewrite tests that don't verify actual behavior. Add assertions to tests that exercise code but don't check results.

5. Add what's missing

AI agents rarely achieve 100% branch coverage on the first pass. Check coverage, identify untested branches, and ask the agent to generate tests for specific scenarios you've identified.

6. Commit

The final test file is a collaboration: AI generated the structure and majority of cases, you refined the assertions and filled gaps.

Metrics from real teams

Teams using AI for test generation consistently report:

60–80% of generated tests pass on first run
Test writing time decreases by 50–70%
Coverage increases because developers actually write tests for code they previously skipped
Net quality: comparable to human-written tests after review, with better edge case coverage

When NOT to use AI-generated tests

Complex integration tests: Tests that require specific database state, network conditions, or multi-service coordination. AI doesn't understand your infrastructure.
Performance tests: AI can't meaningfully predict performance thresholds for your system.
Tests for correctness-critical code: Financial calculations, security logic, or safety-critical systems deserve hand-written tests with explicit reasoning about each case.

The bottom line

AI coding agents are excellent at generating the 80% of unit tests that follow patterns. Use them for boilerplate, happy paths, and edge case enumeration. Invest your time in reviewing assertions, catching over-mocking, and covering the complex scenarios that require domain knowledge.

The goal isn't to trust AI-generated tests blindly. It's to use them as a high-quality first draft that you refine into production-grade tests in a fraction of the time.

But they also make predictable mistakes. Here's how to use them effectively.

What AI coding agents do well

Boilerplate and setup

AI agents excel at generating the scaffolding: imports, test file structure, describe/it blocks, beforeEach hooks, mock setup, and teardown. This is pure mechanical work that follows patterns.

Happy path coverage

Edge case enumeration

Mock generation

What they get wrong

Testing implementation, not behavior

AI agents sometimes test how a function works internally rather than what it produces. Tests that assert on internal method calls or specific execution order are brittle and break during refactoring.

Fix: Review generated tests and ask: "Would this test still pass if I refactored the implementation but kept the same behavior?" If not, rewrite the assertion.

Hallucinated API surfaces

Fix: Run the tests immediately after generation. Compilation errors and missing method failures catch these quickly.

Insufficient assertions

Agents sometimes write tests that exercise the code but don't assert meaningful outcomes. A test that calls a function and checks it "doesn't throw" provides very little value.

Fix: Check that each test asserts on the return value, side effects, or state changes that matter. If a test has no assertions (or only checks truthiness), strengthen it.

Over-mocking

Agents tend to mock aggressively—sometimes mocking the very thing you're trying to test. Over-mocked tests pass regardless of whether the underlying code works.

Fix: Only mock external boundaries (network, database, file system). Let internal modules run as-is.

A practical workflow

Here's the workflow that teams report the best results with:

1. Generate tests from the function

Point your coding agent at a function and ask for unit tests. Be specific about the framework (Jest, pytest, etc.) and testing patterns your project uses.

2. Run immediately

Don't review AI-generated tests manually first. Run them. Failing tests reveal hallucinations and incorrect assumptions faster than reading.

3. Fix failures

Fix the obvious issues: wrong imports, hallucinated APIs, missing mocks. This usually takes 2–5 minutes per file.

4. Review assertions

Now read the passing tests. Are they asserting meaningful outcomes? Remove or rewrite tests that don't verify actual behavior. Add assertions to tests that exercise code but don't check results.

5. Add what's missing

AI agents rarely achieve 100% branch coverage on the first pass. Check coverage, identify untested branches, and ask the agent to generate tests for specific scenarios you've identified.

6. Commit

The final test file is a collaboration: AI generated the structure and majority of cases, you refined the assertions and filled gaps.

Metrics from real teams

Teams using AI for test generation consistently report:

60–80% of generated tests pass on first run
Test writing time decreases by 50–70%
Coverage increases because developers actually write tests for code they previously skipped
Net quality: comparable to human-written tests after review, with better edge case coverage

When NOT to use AI-generated tests

Complex integration tests: Tests that require specific database state, network conditions, or multi-service coordination. AI doesn't understand your infrastructure.
Performance tests: AI can't meaningfully predict performance thresholds for your system.
Tests for correctness-critical code: Financial calculations, security logic, or safety-critical systems deserve hand-written tests with explicit reasoning about each case.

The bottom line

The goal isn't to trust AI-generated tests blindly. It's to use them as a high-quality first draft that you refine into production-grade tests in a fraction of the time.

What AI coding agents do well

Boilerplate and setup

Happy path coverage

Edge case enumeration

Mock generation

What they get wrong

Testing implementation, not behavior

Hallucinated API surfaces

Insufficient assertions

Over-mocking

A practical workflow

1. Generate tests from the function

2. Run immediately

3. Fix failures

4. Review assertions

5. Add what's missing

6. Commit

Metrics from real teams

When NOT to use AI-generated tests

The bottom line

Related posts

What AI coding agents do well

Boilerplate and setup

Happy path coverage

Edge case enumeration

Mock generation

What they get wrong

Testing implementation, not behavior

Hallucinated API surfaces

Insufficient assertions

Over-mocking

A practical workflow

1. Generate tests from the function

2. Run immediately

3. Fix failures

4. Review assertions

5. Add what's missing

6. Commit

Metrics from real teams

When NOT to use AI-generated tests

The bottom line

Related posts