back to writing
POST · 2026.05.29

AI-augmented test failure triage:
a Claude Code agent that re-runs your specs and reads your diffs.

Most teams don't have a running-tests problem. CI runs them. The dashboard lights up red. What they have is a triage problem. Every failing spec is a question: flake, script bug, env down, stale test, or real regression? Here's how I taught an agent to answer that, accurately, without bias.

TL;DR

In the previous post I described an autonomous QA agent that does deploy-time manual exploratory testing in a Docker container. This post is about the other half of the QA day: triaging the test failures your existing regression suite already produces.

If your team is at any kind of scale, you don't have a running-tests problem. CI runs them. Playwright reports the failures. The dashboard lights up red. What you have is a triage problem. Every failing spec is a question: is this flaky, was it me, is the environment down, did the test go stale, or is this a real regression? Answering that question accurately, fast, and without bias is the bottleneck.

I built an agent that does it. Same containerized-Claude-Code pattern from the last post, different slash command, different scope. This is how it works, why the obvious cheaper approach failed for me, and the specific decisions I'd repeat.

01 / the naive approach: ask a cheap LLM

My first attempt at automated triage was the boring, sensible one. For each failing spec:

  1. Pull the stack trace from the database.
  2. Pull the merged PR's changed-file list.
  3. Send both to gpt-4o-mini with a structured prompt asking for a verdict (flake / script-bug / app-bug / environment / infrastructure) and a short paragraph of reasoning.
  4. Render the result in the Slack notification.

It worked. Cost was about a dollar a month. Latency was a second per failure. The Slack message looked great.

The verdicts were frequently wrong in a way that's worth describing in detail, because it taught me what these classifiers actually do and don't see.

A real example, anonymized: a test failed during the data-prep step with a database CHECK CONSTRAINT violation. The test author had literally written [INFRASTRUCTURE] in the error message, their own tagging convention to say "this isn't an app bug, my test setup hit a DB state I didn't anticipate." The merged PR for that deploy touched a controller named OrdersController.php, completely unrelated to the failing data-prep step (different namespace, different endpoint, no shared code).

The cheap classifier saw:

The author tag, the namespace, the actual semantics of the code change: none of it factored in. The classifier did shallow filename matching and called it a day. A developer reading that Slack message would have spent thirty minutes investigating a PR that had nothing to do with the failure.

THE LESSON

This isn't a "use a smarter model" story. It's a "shallow signals are not real evidence" story. The structure of the prompt mattered far less than the structure of the inputs.

02 / the pivot: have the agent do real work

The insight that flipped this around was small. The reason a human QA engineer rarely gets this kind of triage wrong isn't that they're a smarter model. It's that they do different things. They:

None of those steps require a smarter model. They require the model having the tools to do those steps. Claude Code in a Docker container has those tools: shell, file reads, network access. So instead of giving a cheap LLM a static prompt, I gave Claude Code a slash command that does the real work.

03 / the architecture (same as before, different head)

The pattern is the one I described in the previous post. A small Node orchestrator listens for CI events. When a regression run completes with failing specs, it inserts queued rows into a failure_diagnoses table. A single in-process worker picks them up, spawns a Docker container per failing spec (capped at three in parallel), and posts the report to Slack when each container exits.

The container itself is qa-diagnose:latest, same base image strategy as the manual-test agent, but pre-configured for failure triage. It has the test repo and the QA workbench mounted from /srv/repos. Inside, the entrypoint runs:

claude --print --dangerously-skip-permissions "/investigate-failure-headless"

The slash command's job is structured: re-run the failing test in the same environment, read the merged PR's diff text, query the orchestrator for historical context, classify, emit a structured report.

Per failing spec, the agent typically takes 60 to 90 seconds. Most of that is the Playwright re-run. Total cost per regression run: well under a minute for the cheap cases (one or two failures), maybe two minutes for the noisy ones (three concurrent containers). The wall-clock budget is capped at ten minutes per spec via timeout in the entrypoint, in case the re-run hangs.

04 / the vector substrate

There's a piece of this that doesn't fit cleanly into the docker-agent story but is worth describing on its own, because it's what makes the rest cheap.

For each new failure ingested from CI, the orchestrator embeds the stack trace plus error message using text-embedding-3-small (1536 dimensions) and stores the vector in a pgvector column on the failures table. An HNSW index makes cosine-similarity lookups fast.

The substrate cost a one-time six cents to backfill ~14,000 historical Playwright failures, plus negligible ongoing cost (~$0.01/month for new failures). What it buys you:

This was the most "infrastructure" piece of the work, and it pays for itself by routing the easy ~30% of failures away from the AI agent entirely.

05 / the slash command's decision rules

The /investigate-failure-headless slash command isn't a thousand-word prompt. It's a tight set of decision rules:

Step 1. Initial hypothesis from static signals
  - Read the spec file. Know what it exercises.
  - Scan the error message for author tags ([INFRASTRUCTURE],
    [FUNCTIONAL TEST], [FLAKY]). Author tags are NEAR-AUTHORITATIVE.
    Override only with strong contrary evidence.
  - Note the failing assertion and line number.

Step 2. Re-run the spec against the same environment
  Capture full output. Record PASSED / FAILED (same error) /
  FAILED (different error).

Step 3. Branch on the result
  - If PASSED on re-run, confirmed transient. Verdict: flake.
    Skip diff investigation.
  - If FAILED, continue to Step 4.

Step 4. Inspect the actual deploy diff (only if re-run reproduced)
  gh pr diff <pr-number> --repo <owner>/<repo>
  For each changed file, identify which class/method/route it
  modifies. Same FILENAME does NOT mean same CODE PATH.
  Compare against what the failing spec exercises.

Step 5. Classify
  flake | script_bug | app_bug | environment | infrastructure | unknown

Step 6. Emit ONE-LINE SUMMARY followed by structured report.

The interesting clauses are the ones in the static-hypothesis step:

06 / the circuit breaker

There's a failure mode the per-spec design has to handle: the environment is down. A regression run completes with twenty of thirty specs failing, all with similar error fingerprints (HTTP 502, socket hang up, gateway timeout). Spawning twenty diagnose containers in parallel, each one trying to re-run a spec against a dead environment, is wasteful, slow, and produces twenty redundant "the env is down" reports.

The orchestrator has a circuit breaker that checks the run-level cluster size. If more than ~20% of specs in the regression run failed, it switches strategy:

  1. Pick the most-representative failing spec (the one whose trace_hash cluster has the most matches across the run: the dominant error pattern).
  2. Spawn one diagnose container for that spec only.
  3. The single report stands for the whole cluster. Slack message says "the env is down" once, not twenty times.

The trade-off is real. If twenty specs fail with twenty different errors, the circuit breaker still picks one and assumes the others are related. In practice that almost never happens (when a regression run has twenty failures, they're almost always env-correlated). The cost of being wrong is small (one developer manually checks two or three other specs). The cost of not having the circuit breaker is a Slack channel storm and a budget spike.

07 / the Slack notification

This is where the system meets the human. I won't repeat the design from the last post, but two specifics for this use case:

The principle is make the silence meaningful. A flat unstyled "5 specs failed, see report" tells the developer nothing about urgency. The colored bars communicate priority before the developer reads a word.

08 / the economics, revisited

For comparison with the cheap-classifier approach from the start of this post:

gpt-4o-mini (naive) Claude Code agent (what I run)
Cost per failure~$0.001flat-rate subscription, no marginal cost
Per-failure latency< 1s60 to 90s (Playwright re-run dominated)
Reads PR diff text?no (just file list)yes (gh pr diff)
Re-runs the spec?noyes
Respects author tags?weaklyyes (explicit rule)
OrdersController-class casewrongcorrect
Accuracy on confirmed flakesguesses from historyconfirmed by passing re-run

The agent's accuracy is high because it does the work a senior QA does. That includes the cheapest and most decisive piece of work: re-running the spec. Most failures resolve themselves on re-run. The ones that don't are the ones worth a developer's time.

09 / what this unlocks

Three things, in order of how surprising they were to me:

10 / closing, and what's next

The two posts in this series describe two parallel uses of the same idea: containerized Claude Code agents as a workhorse for QA. Manual exploratory testing on every deploy. Failure triage on every regression. Both run on the same docker-agent infrastructure, both use the same slash-command-as-DSL pattern, both produce structured Slack reports, both bill against a flat-rate subscription.

THE TAKEAWAY

If you take one thing from this series, take this: the unlock isn't a smarter model. It's giving the model the tools to do the work a senior engineer does. A Claude Code agent in a Docker container, running a slash command, with shell and file-system and network access. That's a different capability than a chat tool. Most of the value in these two posts isn't the AI. It's the agent + tools + structured workflow + integrations. The AI is just the controller.

The next post in this series, if I write one, will probably be about the workbench repo as institutional knowledge: how I keep slash commands fresh, how I onboard a new QA engineer onto it, and how the version-controlled corpus is starting to look like documentation that maintains itself.

STATUSOPEN TO WORK
PAGE/blog/ai-augmented-test-failure-triage
--:--:--