Improving AI-generated tests using mutation testing

March 22, 2026

So, I vibe-coded me a simple web app the other day (Pingu, a web monitoring and alerting tool for internal use).

To increase the chances of it working correctly, I asked the AIs — there were multiple involved — to write tests as they wrote the code. This is standard red/green TDD practice, and is apparently very effective at ensuring the AI produces meaningful tests.

I wouldn't know, since my request for TDD was summarily ignored. I'd be annoyed, but I remember I always avoided that sort of thing too, so I can't be too harsh on the poor clankers. No matter. I asked the AI to write the tests after the fact.

Now, when you ask AI to write tests against existing working code, they often do stupid stuff: a certain percentage of tests written will invariably be tautological: test themselves, the mocks, the framework, or some combination of these.

To combat this, I asked another AI to carefully review the tests and identify those that don't make sense. As expected, it found more than a few. I presented these findings to the first AI and asked it to improve or delete the offending tests. After this was done, I asked for a re-review and adjustment. After two rounds, I was pretty happy with the results.

Okay, so now I had tests that made sense and 98% test coverage. Not bad. But, do these tests actually catch bugs?

Which brings me finally to the point of this blog post: mutation testing. I asked AI (in a clean new session — all of these steps were done in separate conversations to avoid context pollution) to review the code without looking at the tests, and to identify possible places where we can intentionally introduce a bug.

The reason to do this without looking at the tests is, I didn't want it to look at the tests, figure out what's being tested, and change that. By not looking at the tests, we have a chance to stumble upon a real uncaught bug.

Here's my prompt verbatim:

We have extensive test suite (98% test coverage), but I don't know if tests are reliable.

Go through the codebase, (not tests) and identify places where we could introduce realistic bugs in the codebase, and how to introduce a bug. Write up doc/TESTING-PLAN.md with all of those cases.

Do not change the code yet, just write up the detailed testing plan document (markdown). I'll review it and then we'll do testing case by case and check if our tests detect the problem.

Remember, this is important: do not look at the tests. If you let them into your context they will introduce biases and we will have to start from scratch.

Yeah, it's the good ol' “remember, this is important!” trick. Does it help? Dunno, can't hurt, right? To ensure it didn't cheat anyway, I looked at the session output to check whether it accessed any test files. (It didn't. Good bot.)

Once the test cases (mutations) were defined, I cleared the context again and let it rip. The results:

All 40 mutation tests are complete. Here's the summary:

Results: 30/40 mutations caught (75%)

The test suite is solid on core logic but has 10 gaps, mostly in: [...boring details omitted...]

The full results with suggested tests for each gap are in doc/TESTING-RESULTS.md.

A quarter of the mutations produced bugs that weren't caught. Now I wish I asked AI to come up with a thousand cases! But that probably wouldn't be as effective — AI has a tendency to give you what you ask for even if it's obviously stupid, false, or bullshit.

Mutation testing is an old technique. What's interesting here is that the AI can create more realistic, plausible mutations, much cheaper (in time, money, tears, and sweat) than a human developer. This greatly increases the technique's usefulness in general, and especially when dealing with LLM-generated tests.

To recap, here's what I aim for when doing AI-heavy coding:

Red/green TDD if feasible — and I now know to pay additional attention to LLM ignoring the request.
High test coverage, but still use human judgment on which areas are important.
Use the AI to review the test suite and find meaningless tests.
Use mutation testing to find new problem areas and as a means to test the tests. Importantly, do not let AI look at the tests while writing up the mutations.