POST · 2026.06.03

From bug ticket to opened PR:
a Claude Code agent pipeline that ships its own fixes.

A nine-phase pipeline that takes a bug (ticket ID or free-form Slack message) and ends with an open pull request. Multi-agent committee review with deterministic rubrics. Worktree isolation. Mechanical PR-body assembly. Clean stops at every gate.

date: 2026.06.03 reading: ~15 min tags: qa, claude code, multi-agent review, hyrum's law, yagni, ai code review

TL;DR

The first two posts (manual exploratory testing, failure triage) covered observer agents. This post is about an actor agent that writes code and opens PRs.
One slash command (/auto-fix-bug) orchestrates seven others through a fixed pipeline: repro → plan → fix → verify → manual QA → regression → commit → pre-PR audit → review → open PR.
Multi-agent committee review in two stages: three specialist rubrics (blast radius, root cause, YAGNI) that can VETO, then three identical reviewers voting holistically for stability.
Every fix runs in an isolated git worktree. The developer's checkout is never touched. The worktree is deleted on any gate stop.
The PR body is concatenated verbatim from per-phase report files. The orchestrator never writes prose.

The first post in this series was about a Claude Code agent that does deploy-time manual exploratory testing. The second was about an agent that triages test failures by re-running the spec and reading the PR diff. Both of those are observer agents. They look at code and report. They don't change anything.

This post is about the third agent. It writes code. It takes a bug (an internal ticket ID, or a free-form description in a Slack message) and runs a nine-phase pipeline that ends with an open pull request against the development branch. The PR body is assembled mechanically from per-phase reports. The orchestrator doesn't write prose. There are clean stops at every gate. The work happens in an isolated git worktree so nothing contaminates the developer's checkout.

I won't pretend this is comfortable to deploy. Most teams will (and should) put a higher trust bar on "AI ships code" than "AI observes code." This post is mostly about what makes the trust bar achievable: the design choices that turn "AI fixes bugs" from a parlor trick into something a careful team can actually run.

01 / what `/auto-fix-bug` actually is

A single slash command, /auto-fix-bug, that takes a bug as its argument and orchestrates seven other slash commands in sequence. The seven sub-commands are independent skills. Each one lives as its own markdown file under .claude/commands/, with its own gates and its own structured output. The orchestrator is the thin sequencing and assembly layer that decides flow, stops cleanly on any block, and concatenates per-phase reports into the final PR body.

INPUT (ticket-id or free-form bug description)
   │
   ▼
/repro-bug          → REPLICATED? no → STOP (no_repro)
   │
   ▼
/plan-fix           → PLAN_APPROVED? no → STOP (plan_rejected)
   │ (multi-agent committee review, fix worktree + patch)
   ▼
/verify-fix         → FIX_VERIFIED? no → STOP (fix_failed_verify)
   │ (3x sequential re-runs of the repro on the patched worktree)
   ▼
/manual-testing     → MANUAL_QA blocking? → STOP (manual_qa_blocking)
   │ (autonomous mode, scope-by-diff)
   ▼
/check-regression   → REGRESSION_CLEAN? no → STOP (regression_detected)
   │ (pre-fix baseline + post-fix run)
   ▼
COMMIT + PUSH
   │
   ▼
/pre-pr             → CLEAN? no → STOP (branch_contaminated)
   │
   ▼
/review-pr          → NO CRITICAL? no → STOP (review_critical)
   │ (mechanical checks + LLM review, tier by diff size)
   ▼
gh pr create        → PR URL captured
   │
   ▼
OUTPUT (PR URL + assembled body)

Each gate exists because something can plausibly go wrong at this step, and the right move when something does is to stop and tell the human, not to push through. Bugs that can't be reproduced get logged and stopped. Plans the committee rejects get logged and stopped. Patches that don't pass post-fix regression get logged and stopped. A clean exit at any phase is the success case for that phase's gate. The failure case is "merging despite the warning."

02 / reproduce first, fix later

The first phase is /repro-bug. Before the agent considers writing a single line of fix, it has to demonstrate that the bug exists with a runnable script that produces the failing assertion against a real environment. The output of this phase is a repro-report.md file plus a checked-in reproduction script. Both ride along into the final PR so the reviewer can run it themselves.

This is the single most important constraint in the whole pipeline. A surprising number of "bugs" reported via ticket aren't actually bugs in the system as it exists. They're misremembered behavior, environment confusion, or a missing feature being mistaken for a regression. If you skip the reproduce step you waste the rest of the pipeline. Worse, you generate plausible-looking fixes for non-bugs, which is a much harder failure mode for a reviewer to catch than "the agent gave up."

If /repro-bug can't reproduce the issue, /auto-fix-bug stops with block_no_repro and emits the agent's investigation notes for the operator. That happens often, and it's fine. That's the gate doing its job.

03 / plan and two-stage committee review

This phase is where the design gets interesting. The agent doesn't just generate a fix. It generates a plan for a fix, and the plan goes through two stages of multi-agent review before any code is touched.

Stage A. Three parallel specialists, deterministic rubrics. Three reviewer agents run in parallel, each applying a different deterministic rubric:

/review-blast-radius: applies Hyrum's Law thinking. "All observable behaviors of your system will be depended on by somebody." The reviewer's job is to find load-bearing implicit contracts the fix might disturb.
/review-root-cause: applies a 5 Whys-style probe. "Does this plan address the root cause, or is it a band-aid one layer up?"
/review-minimalism: applies YAGNI. "Does the plan do more than the bug requires? Strip everything that isn't necessary to make the failing assertion pass."

Each specialist emits a structured verdict: PASS, CONCERN, VETO, or SKIP. Any VETO sends the drafter agent back to revise the plan. Stage A re-runs against the revised plan. If specialists keep vetoing across 3 iterations, the orchestrator stops with block_reviewer_irreconcilable. That's not the agents disagreeing, that's the agents collectively saying "this bug isn't safely fixable by a small patch." Which is a real signal.

Stage B. Three identical generalist reviewers, holistic vote. Once Stage A clears (no vetoes), the plan and patch go to a second committee: three identical reviewer agents with the same prompt, voting holistically on the (possibly revised) plan + patch. Two of three must approve. A 2-1 split passes with the dissent annotated in the PR body; a 1-2 or 0-3 split rejects.

DECISION · TWO REVIEW SHAPES, NOT ONE

Stage A is specialist: three different rubrics, deliberately diverse perspectives, deterministic logic. Stage B is generalist: same prompt, three samples, voting for stability against single-LLM variance. You catch different kinds of mistakes with different review structures, and you want both.

When Stage A and Stage B both clear, the agent applies the patch to a dedicated git worktree (not the developer's main checkout) and proceeds.

04 / verify

/verify-fix re-runs the reproduction script from Phase 1 against the patched worktree, three sequential times. All three runs must pass for the gate to clear. Three is enough to catch the basic flakiness modes. If a "fix" passes on first run but fails on second, that's almost always a real problem (the fix introduced timing variability, or only works on a clean state and not after re-entry).

If verification fails, the worktree is preserved for inspection but the orchestrator stops with block_fix_failed_verify. The agent does NOT iterate on the fix here. That's intentional: at this phase we already passed plan review, so a failure to verify means something we didn't anticipate. Better to stop and surface the evidence than to start improvising patches.

05 / manual exploratory QA and regression

Phases 4 and 5 delegate to the manual-testing agent (from post #1) and a regression-check skill. The manual-testing agent runs in autonomous mode with scope restricted to the changed files in the patch. It tests behavior around the fix, not the targeted bug itself (which we already proved is fixed). Looks for: input-fuzzing on changed endpoints, role-boundary on touched routes, adjacent code paths that share helpers, integration with surrounding system.

Regression-check runs a chosen slice of the regression suite twice: once on the pre-fix baseline, once on the post-fix patched worktree. The diff between the two runs is the actual signal. If the post-fix run has more failing specs than the pre-fix baseline, the fix introduced a regression and we stop. If the post-fix has fewer or equal failures (and no new ones), we proceed.

DECISION · PRE-FIX BASELINE, NOT HISTORICAL

Comparing a fix's regression run against a baseline from the same moment is much stronger than comparing against historical pass rates. Environments drift. Flakes happen. A post-fix vs. pre-fix diff isolates the fix's contribution.

06 / commit, push, pre-PR audit

Once Phases 1 through 5 pass, the orchestrator commits in the worktree, pushes the branch, and runs a pre-PR audit that checks:

The branch matches the expected naming pattern.
No stray commits from unrelated work.
No uncommitted leftovers.

Auto-bug-fixer branches almost never fail this gate (we work in a dedicated worktree, no contamination is possible), but the check exists because something deeper has gone wrong if it does, and silently opening a PR full of unrelated commits is worse than stopping with a clear error.

07 / deep diff review

/review-pr runs an automated review against the pushed branch, tiered by diff size:

INLINE (<200 lines changed): no sub-agents. The orchestrator does mechanical checks plus a light LLM pass inline. ~1 minute. Most auto-bug-fixer fixes land here because of the YAGNI gate in plan review.
FAST (200 to 500 lines): spawns Haiku-class reviewer agents in parallel. ~3 to 5 min.
FULL (500+ lines): Sonnet-class reviewers. ~5 to 8 min.

The review verdict gates PR creation: YES passes; YES WITH CONDITIONS passes with concerns surfaced in the PR body for the human reviewer; NO stops with block_review_critical and deletes the just-pushed branch (an orphaned remote branch with a rejected change is worse than no branch).

08 / open PR

The PR body is assembled by concatenating the per-phase report files, in a fixed order:

## Summary             ← from repro-report.md ¶1 + action-plan.md ¶1
## Replication         ← from repro-report.md
## Changes             ← from action-plan.md
## Regression (pre)    ← from regression-baseline-report.md
## Manual QA           ← from manual-testing-report.md
## Regression (post)   ← from regression-postfix-report.md
## Risks & rollback    ← from action-plan.md
## Code review         ← from review-pr-report.md

The orchestrator emits the section headers. Everything else is verbatim from the report files written by the phases. Claude prose in stdout is for the operator. The report files are for downstream consumption. Same principle that makes the manual-testing and triage agents from the earlier posts work: structured output is a contract, prose is not.

09 / design choices worth stealing

A few patterns I'd repeat on any future agent pipeline like this:

Worktree isolation per fix. Git worktrees are a free, lightweight way to give each automated change a clean room. The developer's main checkout is never touched. The worktree gets deleted if the pipeline stops at any gate.
Reports as inter-phase contracts. Phase N writes a structured .md file; phase N+1 reads it. The orchestrator never tries to summarize Phase N's output from memory. If a phase's output isn't in a file, downstream phases can't use it.
Clean stops at every gate. The orchestrator stops on the first gate that says stop. It does NOT try to recover, retry, or improvise. Failure modes are surfaced verbatim. Each block_* code is greppable and routable.
Multi-agent review with diverse rubrics. Three specialists with different prompts catch different problems than three samples of the same prompt. Use both: specialists for diversity, identical-prompt committee for stability.
Tiered review by diff size. The same review rigor doesn't make sense for a one-line typo fix and a 700-line refactor. Tier the reviewer agents (inline, haiku, sonnet) and let the diff size pick.
YAGNI as a structural gate, not advice. The minimalism reviewer in Stage A doesn't just prefer small diffs. It can VETO if the plan does more than the bug requires. That single rule keeps auto-bug-fixer PRs almost always in the INLINE tier, which keeps review fast, which keeps the trust feedback loop tight.
Pre-PR audit is mandatory. Even when you "know" the branch is clean (because you created it yourself in a dedicated worktree), run the audit. The branch you "know" is clean is exactly the branch that's not.

10 / the economics, again

I've covered this in the previous posts: the agent runs against subscription auth in a Docker container, so the marginal cost per PR is effectively zero. What matters is the trust economics, not the dollar economics. A pipeline like this is acceptable when:

It opens small, single-concern PRs (the YAGNI gate enforces this).
Every PR ships with a runnable reproduction and a clean regression diff.
The human reviewer has the same information the agent had, plus the agent's own audit trail.
Failures stop instead of pushing through.

It is not acceptable as a "ship-it-while-I-sleep" button, and the gates are there to make that distinction unambiguous.

11 / closing, and the series so far

Three posts, three Claude Code agents, one infrastructure pattern:

Deploy-time manual testing: an observer agent that watches every deploy.
Failure triage: an observer agent that watches every failed regression run.
Auto-bug-fixer: an actor agent that writes patches and opens PRs.

The two observers are easy to deploy because their blast radius is small. The third is the bold one, and the architecture is shaped to bound the boldness: committee reviews with deterministic rubrics, worktree isolation, mechanical PR-body assembly, clean stops at every gate, the same regression suite running pre-fix and post-fix to isolate the fix's contribution.

THE TAKEAWAY

The thing that surprises people when I describe this is that most of the engineering isn't AI engineering. It's pipeline engineering: file contracts between phases, gate logic, branch hygiene, structured outputs, audit trails. The AI is the controller of each phase. The pipeline is what makes the controller's output safe to ship.

The next post, if I write one, is probably about the workbench repo as a living institutional asset. How slash commands compose, how skills mature over time, how a QA org's knowledge becomes searchable, diffable, reviewable. The infrastructure pattern is mostly the same. The lens shifts from what the agents do to what the workbench becomes when you maintain it for a year.