Improving AI-generated tests using mutation testing

So, I vibe-coded me a simple web app the other day (Pingu, a simple web monitoring and alerting tool, just for internal use).

To increase the chances of it working correctly, I asked the AIs — there were multiple involved — to write tests as they were writing the code. This is standard red/green TDD practice, and is apparently very effective for making sure the AI produces meaningful tests.

I wouldn't know, as my request for TDD was summarily ignored. I'd be annoyed, but I remember I always avoided that sort of thing too, so I can't be too harsh on the poor clankers. No matter. I asked the AI to write the tests after the fact.

Now, when you ask AI to write tests against existing working code, they often do stupid stuff: a certain percentage of tests written will invariably be tautological, test themselves, the mocks or the framework, or some combination of these.

To combat this, I asked another AI to carefully review the tests and identify those that don't make sense. As expected, it found more than a few. I presented these findings to the first one and asked it to improve or delete the offending tests. After this was done, I asked for a re-review and adjustment again. After two rounds, I was pretty happy with the results.

Okay, so now I had tests that make sense and ended up with 98% test coverage. Not bad. But, do these tests actually catch bugs?

Which brings me finally to the point of this blog post: mutation testing. I asked AI (in a clean new session — all of these steps were done in separate conversations to avoid context pollution) to review the code without looking at the tests, and to write up possible places where we can intentionally introduce a bug.

The reason to do this without looking at the tests is, I didn't want it to look at the test, figure out what's being tested, and change that. By not looking at the tests, we have a chance of stumbling upon a real uncaught bug.

Here's my prompt verbatim:

We have extensive test suite (98% test coverage), but I don't know if tests are reliable.

Go through the codebase, (not tests) and identify places where we could introduce realistic bugs in the codebase, and how to introduce a bug. Write up doc/TESTING-PLAN.md with all of those cases.

Do not change the code yet, just write up the detailed testing plan document (markdown). I'll review it and then we'll do testing case by case and check if our tests detect the problem.

Remember, this is important: do not look at the tests. If you let them into your context they will introduce biases and we will have to start from scratch.

Yeah, it's the good ol' “remember, this is important!” trick. Does it help? Dunno, can't hurt, right? To ensure it didn't cheat anyway, I looked at the session output to spot it accessing any test files. (It didn't. Good bot.)

Once the test cases (mutations) were defined, I cleared the context again and let it rip. The results:

All 40 mutation tests are complete. Here's the summary:

Results: 30/40 mutations caught (75%)

The test suite is solid on core logic but has 10 gaps, mostly in: [...boring details omitted...]

The full results with suggested tests for each gap are in doc/TESTING-RESULTS.md.

A quarter of the mutations produced bugs that weren't caught. Now I wish I asked AI to come up with a thousand cases! But that probably wouldn't be as effective — AI has a tendency to give you what you ask for even if it's obviously stupid, false, or bullshit.

Mutation testing is an old technique. What's interesting here is that the AI can create more realistic, plausible mutations, much cheaper (in time, money, tears, and sweat) that a human developer. This greatly increases the technique's usefulness in general, and especially when dealing with LLM-generated tests.

To recap, here's what I aim for when doing AI-heavy coding:

  1. Red/green TDD if feasible — and I now know to pay additional attention to LLM ignoring the request.
  2. High test coverage, but still use human judgement on which areas are important.
  3. Use the AI to review the test suite and find meaningless tests.
  4. Use mutation testing to find new problem areas and as a means to test the tests. Importantly, do not let AI look at the tests while writing up the mutations.