QA Guidelines — A New Way to Test AI-Generated Code

01 — The Problem

AI Makes Code Easy. Testing It Is Still Hard.

AI has made spinning up new tools trivially easy. Need an MCP server? A webhook handler? A CLI tool? You can have working code in minutes. The barrier to creating software has collapsed.

Traditional test infrastructure is hard. Writing reliable, deterministic test suites requires deep understanding of the system, careful mocking, and ongoing maintenance as the code evolves. It’s the kind of work that doesn’t benefit much from AI assistance because the understanding is the hard part, not the typing.

QA being invoked in Claude Code CLI — the user types 'QA the datadog mcp' and the agent begins reasoning — Invoking QA from Claude Code — just describe what to test

But how can we be sure that AI-generated code actually works?

⚠️

Disclaimer

I do not recommend this as how we should start testing at Vantage. The intention of this document is to introduce a concept which has been surprisingly effective in the cases where I’ve applied it.

02 — A Belief

Traditional Tests Should Be Written by a Human

There’s a temptation to have AI write the tests too, but that temptation should be resisted. Tests define requirements.

When AI writes both the implementation and the tests, you get a closed loop — the tests validate the AI’s interpretation of the requirements, not the actual requirements. Bugs become features because the test says so.

Writing tests forces humans to think critically about edge cases, failure modes, and what “correct” actually means. This is one of the jobs humans will be needed for in the foreseeable future: defining the spec, setting the bar, deciding what matters.

AI can help you run the tests and even suggest what to test, but the human must own the definition of done.

A Different Approach: QA Guidelines

Leverage AI to verify software works — the same way a human would.

03 — The Approach

It Feels Wrong. But It Works.

QA guidelines offer a way to leverage AI to verify that software works in the same way a human would. The idea sounds wrong at first — and then you see the results.

It Feels Wrong

Non-deterministic — An LLM interprets the QA.md and decides how to run checks
Expensive — Each QA run costs API tokens
Slow — Agent reasoning adds latency vs. unit tests

But It’s Effective

Low maintenance — QA.md files describe what to verify, not how to mock
Readable — Anyone can understand and update the test plan
Flexible — The agent handles environment variance and judgment calls
Catches real bugs — Tests run against actual built artifacts

The insight: for tools that get spun up quickly and may not live long, the ROI on traditional test infrastructure is poor. A QA markdown file takes 10 minutes to write and can catch most regressions.

Minutes to write

Tools passing

Mocks required

04 — The System

How It Works

QA guidelines are built upon two components: QA.md files that describe what to verify, and a QA Agent that runs them.

QA.md Files

Each component — an app, service, tool, CLI — gets a QA.md file that acts as an executable spec:

QA.md structure

Prerequisites — defines the MCP tools and agent skills necessary for QA
Steps — numbered test steps, each explaining how to verify a piece of the component
Completion criteria — requirements that must be met for QA to pass

The user kicks it off. Type a natural language request like “QA the datadog mcp” and the agent takes over from there.

The orchestrator discovers components. qa.sh scans the repo for QA.md files. It accepts --strict mode (fail on any issue) and --component=PATH to target specific components.

Discovery finds all testable components. Every directory with a QA.md file is identified as a component that can be QA’d.

QA runs in parallel. Each component gets its own agent that reads the QA.md, executes each step, and records pass/fail results. Components run concurrently for speed.

Results aggregate into a report. Pass/fail per component, timestamped, with branch info. Reports track QA state over time — you can see exactly when something broke.

Why It Works

Component authors verify their own QA steps work (deterministically, once)
The agent handles the tedium of re-running those steps later
Reports track QA state over time by branch and timestamp

📝

For web apps, QA files should indicate that the Playwright MCP tool is required. The agent will use browser automation to verify UI behavior.

QA Guidelines · Concept Document