How testers should evaluate agent output

When a team starts using AI agents in earnest, a new task shows up that wasn’t there before - verifying what the agent produced. For a tester that’s actually good news, because critical review of output is something we do every day anyway. What changes is the source and the failure patterns hiding behind that source.

This post lays out how I approach that evaluation systematically. The material under review varies: a test scenario generated by an agent, a changelog analysis, a proposed release readiness checklist, generated API documentation. The evaluation mechanism is the same - only the specifics on the page in front of me change.

One disclaimer upfront to avoid confusion. The goal isn’t to „catch the AI lying”. The goal is to have a stable set of questions I ask every output before it moves further - to the team, to the client, or into the repo.

Five dimensions of evaluation

In practice I look at five dimensions. Each answers a different question, and each catches a different class of problem.

Completeness

The first dimension is simple: does the output cover what it was supposed to cover. An agent asked for test scenarios for a new coupon feature can produce eight excellent scenarios - and alongside them, four silent omissions. The gaps won’t surface by staring at what was written. They surface when you compare against what should be there.

What works best for me is writing down the list of areas before calling the agent. For scenarios that list includes happy path, main negative paths, edge cases, validation errors, interactions with other modules, offline behaviour and non-functional requirements. When the result comes back, I confront it item by item.

Worth noting one consistent bias: agents lean hard toward positive paths. If you don’t explicitly ask for an error class, it will almost always be skipped.

Factual correctness

The most obvious dimension, and paradoxically the most neglected during review. The mechanism is simple: the output sounds credible, so we stop questioning the details.

Three approaches work for me. First - cross-check against the code. If the agent describes a function’s behaviour, three randomly chosen claims compared with the implementation are enough to trust the rest. Second - cross-check with the spec or documentation, especially when the agent refers to requirements. Good tools cite; when they don’t, I stay alert. Third - the standalone-claim test. I pick one sentence from the output and ask the agent for its source or a reconstruction of the reasoning. No good answer is a signal that the rest needs closer verification too.

The most dangerous category is what I call „precise but false”. A statement like „endpoint /api/v2/discounts accepts a max_uses field” sounds specific and authoritative, but sometimes it’s simply invented. The more detailed a technical claim looks, the more cautiously I take it.

Domain alignment

The third dimension is about the conventions of your project, your team, your product. Agents neglect this on a massive scale, because they simply don’t know those conventions. That knowledge isn’t in their training - unless you explicitly supply it through AGENTS.md, documentation or examples.

In practice I check four things. Naming - does the scenario follow your convention, for example should ... when ... instead of a descriptive sentence. Selectors and identifiers - did the agent use data-testid rather than CSS classes, if that’s the rule you agreed on. Product terminology, especially where a distinction matters commercially („user” vs „customer” vs „merchant”). Structural conventions - where the file lives, what the header looks like, which imports belong there.

Skipping this dimension leads to output that is technically correct but doesn’t fit the team. It comes back in review even when everything else is fine.

Traceability to sources

The fourth dimension is critical for anything the agent generates on the basis of evidence - log analyses, bug history, documentation. Without traceability the reviewer has no way to verify correctness.

Good output points to concrete sources: a ticket ID, a log line, a file path, a commit hash. Links are clickable. Where version or date matters, it’s stated explicitly. Bad output traffics in phrases like „our logs show…” without specifying which; „the docs state…” without a reference; „in recent commits…” without hashes.

It sounds strict, and it is. Without traceability we can’t tell an evidence-based conclusion from a hallucination, and in QA that difference costs visibly more than the extra fifteen seconds spent adding a citation.

The „pretty nonsense” risk

The last dimension is meta - it’s about the reviewer’s self-awareness. Well-phrased, neatly structured, stylistically consistent text creates an illusion of correctness. After two hours of review, a tester starts trusting form rather than content. That’s the moment review quality starts sliding quietly.

The antidote is boring but effective: I pick random fragments and check them aggressively. If a random fragment passes three deep checks, the rest can be trusted. If it breaks, the rest needs deeper verification, not shallow acceptance.

The second mechanism is simple counting. How many times in my review did I say „looks reasonable”? More than twice for a single output means I’m reviewing form, not content, and I need to stop and reset.

A practical review checklist

The five dimensions compose into a fifteen-item checklist I keep as a template.

Completeness

Output covers all areas from the list I wrote before calling the agent.
No obvious missing scenario classes or threads.
Scope matches what I asked for.

Correctness

Spot-check three claims against code or spec.
No detailed-but-unverifiable facts (endpoint names, fields, constants).
Numeric values, where present, have a source.

Domain alignment

Naming matches the project’s conventions.
Selectors and file structure match the repo.
Product terminology is consistent.

Traceability

Every evidence-based claim has a citation.
Cited sources are openable and current.
No „our data shows…” without a link.

Pretty nonsense

Form didn’t mask missing content - I did a random sample.
I didn’t accept anything „because it looks reasonable”.
I hand-verified the three most precise claims.

Fifteen points and fifteen minutes, once you do it regularly. The first passes are longer because you still need to learn where output typically breaks.

Situations where I reject without discussion

There are cases I don’t spend time reviewing carefully - I just send the output back. I treat them as red flags.

No sources for evidence claims. Output like „last month we had flaky tests in area X”, without naming which. Nothing to debate - I send it back and ask for specifics.

Invented API, field or file names. One such instance is enough to make the whole output suspect. I send it back and regenerate with an explicit citation requirement.

Internal contradictions. „The test should verify that the coupon is single-use” - and three lines later „…after multiple uses the coupon still works”. Both lines might make sense in isolation, but the agent didn’t notice the conflict. Send it back.

Non-compliance with explicit instructions. I asked for BDD-style scenarios and got an imperative list of steps. I don’t fix it by hand - I send it back. Otherwise the agent learns it can wriggle out of instructions.

Output that’s too generic. „The system should be reliable” where I asked for concrete scenarios. Send it back with a request for specificity.

The economics are simple: sending it back usually costs thirty seconds of prompting, a manual fix costs thirty minutes of work. Sending it back is pro-quality.

Scaling review in a larger team

At some point doing this yourself stops scaling - there are too many outputs and testers have other work too. Two practices that have worked for me.

The first is AI reviewing AI as a pre-filter. A second model walks the same checklist and flags fragments that need human attention. It doesn’t eliminate human review - it removes the routine layer. When the pre-filter says „overall looks coherent, but three evidence claims have no citation”, the reviewer knows exactly where to start.

The second is regression of your own review notes. I keep a plain log after each review: „third time I’ve flagged a missing citation”, „another made-up API field”, „another positive scenario without a negative counterpart”. After a few weeks I have a map of the weak spots of a particular agent or a particular prompt. Some of those observations go into the system prompt as an instruction, some into AGENTS.md, some stay permanently on the checklist.

Evaluation quality grows with process maturity. The first reviews are long and demand focus. The tenth ones are fast, because you know where output typically breaks and where there’s no need to over-verify.

Closing

The five dimensions - completeness, factual correctness, domain alignment, traceability and the „pretty nonsense” risk - each catch a different class of error. Skipping any of them leaves a steady quality leak. The fifteen-point checklist fits into a fifteen-minute review. The five send-back situations save manual repair work. Scaling comes from an AI pre-filter plus systematic regression of your own observations into the system prompt and AGENTS.md.

In the next post - a different thread: how to turn an article like this into a thirty-second video explainer embedded under the piece and reused across social and training.