AI Makes Code Easy. Testing It Is Still Hard.
AI has made spinning up new tools trivially easy. Need an MCP server? A webhook handler? A CLI tool? You can have working code in minutes. The barrier to creating software has collapsed.
Traditional test infrastructure is hard. Writing reliable, deterministic test suites requires deep understanding of the system, careful mocking, and ongoing maintenance as the code evolves. It’s the kind of work that doesn’t benefit much from AI assistance because the understanding is the hard part, not the typing.
But how can we be sure that AI-generated code actually works?
I do not recommend this as how we should start testing at Vantage. The intention of this document is to introduce a concept which has been surprisingly effective in the cases where I’ve applied it.
Traditional Tests Should Be Written by a Human
There’s a temptation to have AI write the tests too, but that temptation should be resisted. Tests define requirements.
When AI writes both the implementation and the tests, you get a closed loop — the tests validate the AI’s interpretation of the requirements, not the actual requirements. Bugs become features because the test says so.
Writing tests forces humans to think critically about edge cases, failure modes, and what “correct” actually means. This is one of the jobs humans will be needed for in the foreseeable future: defining the spec, setting the bar, deciding what matters.
AI can help you run the tests and even suggest what to test, but the human must own the definition of done.
A Different Approach: QA Guidelines
Leverage AI to verify software works — the same way a human would.
It Feels Wrong. But It Works.
QA guidelines offer a way to leverage AI to verify that software works in the same way a human would. The idea sounds wrong at first — and then you see the results.
- Non-deterministic — An LLM interprets the QA.md and decides how to run checks
- Expensive — Each QA run costs API tokens
- Slow — Agent reasoning adds latency vs. unit tests
- Low maintenance — QA.md files describe what to verify, not how to mock
- Readable — Anyone can understand and update the test plan
- Flexible — The agent handles environment variance and judgment calls
- Catches real bugs — Tests run against actual built artifacts
The insight: for tools that get spun up quickly and may not live long, the ROI on traditional test infrastructure is poor. A QA markdown file takes 10 minutes to write and can catch most regressions.
How It Works
QA guidelines are built upon two components: QA.md files that describe what to verify, and a QA Agent that runs them.
QA.md Files
Each component — an app, service, tool, CLI — gets a QA.md file that acts as an executable spec:
- Prerequisites — defines the MCP tools and agent skills necessary for QA
- Steps — numbered test steps, each explaining how to verify a piece of the component
- Completion criteria — requirements that must be met for QA to pass
The user kicks it off. Type a natural language request like “QA the datadog mcp” and the agent takes over from there.
The orchestrator discovers components. qa.sh scans the repo for QA.md files. It accepts --strict mode (fail on any issue) and --component=PATH to target specific components.
Discovery finds all testable components. Every directory with a QA.md file is identified as a component that can be QA’d.
QA runs in parallel. Each component gets its own agent that reads the QA.md, executes each step, and records pass/fail results. Components run concurrently for speed.
Results aggregate into a report. Pass/fail per component, timestamped, with branch info. Reports track QA state over time — you can see exactly when something broke.
Why It Works
- Component authors verify their own QA steps work (deterministically, once)
- The agent handles the tedium of re-running those steps later
- Reports track QA state over time by branch and timestamp
For web apps, QA files should indicate that the Playwright MCP tool is required. The agent will use browser automation to verify UI behavior.