POST · 2026.05.29

AI-augmented test failure triage:
a Claude Code agent that re-runs your specs and reads your diffs.

Most teams don't have a running-tests problem. CI runs them. The dashboard lights up red. What they have is a triage problem. Every failing spec is a question: flake, script bug, env down, stale test, or real regression? Here's how I taught an agent to answer that, accurately, without bias.

date: 2026.05.29 reading: ~13 min tags: qa, claude code, flake detection, playwright, pgvector, ci/cd

TL;DR

My first attempt was a cheap LLM (gpt-4o-mini) reading stack traces and PR file lists. It worked at the prompt level and was wrong at the verdict level. Shallow signals are not real evidence.
The pivot: stop asking for a smarter model, give the model the tools a senior QA uses. Re-run the spec. Read the diff hunks. Respect author tags. Compare subsystems, not filenames.
Cheap pgvector substrate (~$0.06 one-time backfill, ~$0.01/month) handles the easy 30% via kNN. The Claude Code agent handles the ambiguous remainder.
Circuit breaker for env-wide failures: one report stands for the cluster instead of twenty redundant containers and a Slack storm.
Same docker-agent infrastructure as the previous post, different slash command, different scope.

In the previous post I described an autonomous QA agent that does deploy-time manual exploratory testing in a Docker container. This post is about the other half of the QA day: triaging the test failures your existing regression suite already produces.

If your team is at any kind of scale, you don't have a running-tests problem. CI runs them. Playwright reports the failures. The dashboard lights up red. What you have is a triage problem. Every failing spec is a question: is this flaky, was it me, is the environment down, did the test go stale, or is this a real regression? Answering that question accurately, fast, and without bias is the bottleneck.

I built an agent that does it. Same containerized-Claude-Code pattern from the last post, different slash command, different scope. This is how it works, why the obvious cheaper approach failed for me, and the specific decisions I'd repeat.

01 / the naive approach: ask a cheap LLM

My first attempt at automated triage was the boring, sensible one. For each failing spec:

Pull the stack trace from the database.
Pull the merged PR's changed-file list.
Send both to gpt-4o-mini with a structured prompt asking for a verdict (flake / script-bug / app-bug / environment / infrastructure) and a short paragraph of reasoning.
Render the result in the Slack notification.

It worked. Cost was about a dollar a month. Latency was a second per failure. The Slack message looked great.

The verdicts were frequently wrong in a way that's worth describing in detail, because it taught me what these classifiers actually do and don't see.

A real example, anonymized: a test failed during the data-prep step with a database CHECK CONSTRAINT violation. The test author had literally written [INFRASTRUCTURE] in the error message, their own tagging convention to say "this isn't an app bug, my test setup hit a DB state I didn't anticipate." The merged PR for that deploy touched a controller named OrdersController.php, completely unrelated to the failing data-prep step (different namespace, different endpoint, no shared code).

The cheap classifier saw:

Failing spec is about "order creation".
PR changed OrdersController.php.
Filename has "Order" in it.
Verdict: app bug, likely caused by the merged PR.

The author tag, the namespace, the actual semantics of the code change: none of it factored in. The classifier did shallow filename matching and called it a day. A developer reading that Slack message would have spent thirty minutes investigating a PR that had nothing to do with the failure.

THE LESSON

This isn't a "use a smarter model" story. It's a "shallow signals are not real evidence" story. The structure of the prompt mattered far less than the structure of the inputs.

02 / the pivot: have the agent do real work

The insight that flipped this around was small. The reason a human QA engineer rarely gets this kind of triage wrong isn't that they're a smarter model. It's that they do different things. They:

Re-run the failing test, on the same environment, with the same data, and see if it reproduces.
Read the actual diff text (not just the file list, but the diff hunks) to see which methods and lines changed.
Cross-check the spec's subsystem against the diff's subsystem (admin endpoint vs. customer endpoint, write path vs. read path).
Notice the author's inline tags.
Check whether other specs in the same run also failed with the same error fingerprint (which would mean it's environmental, not specific to this spec).

None of those steps require a smarter model. They require the model having the tools to do those steps. Claude Code in a Docker container has those tools: shell, file reads, network access. So instead of giving a cheap LLM a static prompt, I gave Claude Code a slash command that does the real work.

03 / the architecture (same as before, different head)

The pattern is the one I described in the previous post. A small Node orchestrator listens for CI events. When a regression run completes with failing specs, it inserts queued rows into a failure_diagnoses table. A single in-process worker picks them up, spawns a Docker container per failing spec (capped at three in parallel), and posts the report to Slack when each container exits.

The container itself is qa-diagnose:latest, same base image strategy as the manual-test agent, but pre-configured for failure triage. It has the test repo and the QA workbench mounted from /srv/repos. Inside, the entrypoint runs:

claude --print --dangerously-skip-permissions "/investigate-failure-headless"

The slash command's job is structured: re-run the failing test in the same environment, read the merged PR's diff text, query the orchestrator for historical context, classify, emit a structured report.

Per failing spec, the agent typically takes 60 to 90 seconds. Most of that is the Playwright re-run. Total cost per regression run: well under a minute for the cheap cases (one or two failures), maybe two minutes for the noisy ones (three concurrent containers). The wall-clock budget is capped at ten minutes per spec via timeout in the entrypoint, in case the re-run hangs.

04 / the vector substrate

There's a piece of this that doesn't fit cleanly into the docker-agent story but is worth describing on its own, because it's what makes the rest cheap.

For each new failure ingested from CI, the orchestrator embeds the stack trace plus error message using text-embedding-3-small (1536 dimensions) and stores the vector in a pgvector column on the failures table. An HNSW index makes cosine-similarity lookups fast.

The substrate cost a one-time six cents to backfill ~14,000 historical Playwright failures, plus negligible ongoing cost (~$0.01/month for new failures). What it buys you:

Semantic clustering of failures. "Show me other failures that look like this one" is a vector query, not a string match. Stack traces with different dynamic IDs cluster correctly.
Cheap nearest-neighbor classification. Once a small seed set of failures is labeled (you can mine a surprising number of labels from existing data: test authors who used inline tags like [INFRASTRUCTURE], your investigation notes from past triages, your "auto-closed: re-run passed" notes), a kNN classifier on those labels gets ~70% accuracy for free, with no AI call per query.
A skip signal. If the kNN classifier confidently labels a new failure as flake based on five nearest neighbors that were all flakes, the docker agent doesn't need to fire. The cheap path handles the easy cases. The expensive path handles the ambiguous ones.

This was the most "infrastructure" piece of the work, and it pays for itself by routing the easy ~30% of failures away from the AI agent entirely.

05 / the slash command's decision rules

The /investigate-failure-headless slash command isn't a thousand-word prompt. It's a tight set of decision rules:

Step 1. Initial hypothesis from static signals
  - Read the spec file. Know what it exercises.
  - Scan the error message for author tags ([INFRASTRUCTURE],
    [FUNCTIONAL TEST], [FLAKY]). Author tags are NEAR-AUTHORITATIVE.
    Override only with strong contrary evidence.
  - Note the failing assertion and line number.

Step 2. Re-run the spec against the same environment
  Capture full output. Record PASSED / FAILED (same error) /
  FAILED (different error).

Step 3. Branch on the result
  - If PASSED on re-run, confirmed transient. Verdict: flake.
    Skip diff investigation.
  - If FAILED, continue to Step 4.

Step 4. Inspect the actual deploy diff (only if re-run reproduced)
  gh pr diff <pr-number> --repo <owner>/<repo>
  For each changed file, identify which class/method/route it
  modifies. Same FILENAME does NOT mean same CODE PATH.
  Compare against what the failing spec exercises.

Step 5. Classify
  flake | script_bug | app_bug | environment | infrastructure | unknown

Step 6. Emit ONE-LINE SUMMARY followed by structured report.

The interesting clauses are the ones in the static-hypothesis step:

"Author tags are NEAR-AUTHORITATIVE." If the test author wrote [INFRASTRUCTURE] in their error message, they almost certainly know more about the failure mode than the agent does. Override only with strong contrary evidence. This single rule eliminated most of the false-positive "app bug" verdicts the cheap classifier produced.
"Same FILENAME does NOT mean same CODE PATH." Multiple controllers can share a filename (e.g., OrdersController.php exists in both admin and customer namespaces). The agent has to read the patch hunks and identify the actual class and method, not pattern-match on basename.
"If PASSED on re-run, skip diff investigation." Don't waste time investigating a PR that's almost certainly innocent. The re-run is the cheapest definitive signal you have.

06 / the circuit breaker

There's a failure mode the per-spec design has to handle: the environment is down. A regression run completes with twenty of thirty specs failing, all with similar error fingerprints (HTTP 502, socket hang up, gateway timeout). Spawning twenty diagnose containers in parallel, each one trying to re-run a spec against a dead environment, is wasteful, slow, and produces twenty redundant "the env is down" reports.

The orchestrator has a circuit breaker that checks the run-level cluster size. If more than ~20% of specs in the regression run failed, it switches strategy:

Pick the most-representative failing spec (the one whose trace_hash cluster has the most matches across the run: the dominant error pattern).
Spawn one diagnose container for that spec only.
The single report stands for the whole cluster. Slack message says "the env is down" once, not twenty times.

The trade-off is real. If twenty specs fail with twenty different errors, the circuit breaker still picks one and assumes the others are related. In practice that almost never happens (when a regression run has twenty failures, they're almost always env-correlated). The cost of being wrong is small (one developer manually checks two or three other specs). The cost of not having the circuit breaker is a Slack channel storm and a budget spike.

07 / the Slack notification

This is where the system meets the human. I won't repeat the design from the last post, but two specifics for this use case:

One colored attachment per failing spec. Red for app bug (the developer needs to look). Amber for flake and script bug (caution, not urgent). Gray for infrastructure and environment (out of dev's hands).
@-mention only when an investigation failed to complete (timeouts, container errors). Successful investigations don't ping anyone. A green run with all flakes is silent. Real issues are loud.

The principle is make the silence meaningful. A flat unstyled "5 specs failed, see report" tells the developer nothing about urgency. The colored bars communicate priority before the developer reads a word.

08 / the economics, revisited

For comparison with the cheap-classifier approach from the start of this post:

	gpt-4o-mini (naive)	Claude Code agent (what I run)
Cost per failure	~$0.001	flat-rate subscription, no marginal cost
Per-failure latency	< 1s	60 to 90s (Playwright re-run dominated)
Reads PR diff text?	no (just file list)	yes (gh pr diff)
Re-runs the spec?	no	yes
Respects author tags?	weakly	yes (explicit rule)
OrdersController-class case	wrong	correct
Accuracy on confirmed flakes	guesses from history	confirmed by passing re-run

The agent's accuracy is high because it does the work a senior QA does. That includes the cheapest and most decisive piece of work: re-running the spec. Most failures resolve themselves on re-run. The ones that don't are the ones worth a developer's time.

09 / what this unlocks

Three things, in order of how surprising they were to me:

Slack channel quiet again. When the agent runs first, the developer only sees the failures that matter. The "is this me?" reflex stops firing on every red CI run.
Pre-PR confidence. Some developers now wait for the agent's report before opening the PR for review. The agent's verdict is a first-pass smell test that catches obvious regressions cheaply.
Triage knowledge compounds. Every Claude Code investigation writes a structured verdict and reasoning to the database. Over time that's a corpus of "here's what this kind of failure usually means," queryable, diffable. The next iteration of the agent can read its own history.

10 / closing, and what's next

The two posts in this series describe two parallel uses of the same idea: containerized Claude Code agents as a workhorse for QA. Manual exploratory testing on every deploy. Failure triage on every regression. Both run on the same docker-agent infrastructure, both use the same slash-command-as-DSL pattern, both produce structured Slack reports, both bill against a flat-rate subscription.

THE TAKEAWAY

If you take one thing from this series, take this: the unlock isn't a smarter model. It's giving the model the tools to do the work a senior engineer does. A Claude Code agent in a Docker container, running a slash command, with shell and file-system and network access. That's a different capability than a chat tool. Most of the value in these two posts isn't the AI. It's the agent + tools + structured workflow + integrations. The AI is just the controller.

The next post in this series, if I write one, will probably be about the workbench repo as institutional knowledge: how I keep slash commands fresh, how I onboard a new QA engineer onto it, and how the version-controlled corpus is starting to look like documentation that maintains itself.