From bug ticket to opened PR:
a Claude Code agent pipeline that ships its own fixes.
A nine-phase pipeline that takes a bug (ticket ID or free-form Slack message) and ends with an open pull request. Multi-agent committee review with deterministic rubrics. Worktree isolation. Mechanical PR-body assembly. Clean stops at every gate.
- The first two posts (manual exploratory testing, failure triage) covered observer agents. This post is about an actor agent that writes code and opens PRs.
- One slash command (
/auto-fix-bug) orchestrates seven others through a fixed pipeline: repro → plan → fix → verify → manual QA → regression → commit → pre-PR audit → review → open PR. - Multi-agent committee review in two stages: three specialist rubrics (blast radius, root cause, YAGNI) that can VETO, then three identical reviewers voting holistically for stability.
- Every fix runs in an isolated git worktree. The developer's checkout is never touched. The worktree is deleted on any gate stop.
- The PR body is concatenated verbatim from per-phase report files. The orchestrator never writes prose.
The first post in this series was about a Claude Code agent that does deploy-time manual exploratory testing. The second was about an agent that triages test failures by re-running the spec and reading the PR diff. Both of those are observer agents. They look at code and report. They don't change anything.
This post is about the third agent. It writes code. It takes a bug (an internal ticket ID, or a free-form description in a Slack message) and runs a nine-phase pipeline that ends with an open pull request against the development branch. The PR body is assembled mechanically from per-phase reports. The orchestrator doesn't write prose. There are clean stops at every gate. The work happens in an isolated git worktree so nothing contaminates the developer's checkout.
I won't pretend this is comfortable to deploy. Most teams will (and should) put a higher trust bar on "AI ships code" than "AI observes code." This post is mostly about what makes the trust bar achievable: the design choices that turn "AI fixes bugs" from a parlor trick into something a careful team can actually run.
01 / what /auto-fix-bug actually is
A single slash command, /auto-fix-bug, that
takes a bug as its argument and orchestrates seven other
slash commands in sequence. The seven sub-commands are
independent skills. Each one lives as its own markdown file
under .claude/commands/, with its own gates and
its own structured output. The orchestrator is the thin
sequencing and assembly layer that decides flow, stops
cleanly on any block, and concatenates per-phase reports
into the final PR body.
INPUT (ticket-id or free-form bug description)
│
▼
/repro-bug → REPLICATED? no → STOP (no_repro)
│
▼
/plan-fix → PLAN_APPROVED? no → STOP (plan_rejected)
│ (multi-agent committee review, fix worktree + patch)
▼
/verify-fix → FIX_VERIFIED? no → STOP (fix_failed_verify)
│ (3x sequential re-runs of the repro on the patched worktree)
▼
/manual-testing → MANUAL_QA blocking? → STOP (manual_qa_blocking)
│ (autonomous mode, scope-by-diff)
▼
/check-regression → REGRESSION_CLEAN? no → STOP (regression_detected)
│ (pre-fix baseline + post-fix run)
▼
COMMIT + PUSH
│
▼
/pre-pr → CLEAN? no → STOP (branch_contaminated)
│
▼
/review-pr → NO CRITICAL? no → STOP (review_critical)
│ (mechanical checks + LLM review, tier by diff size)
▼
gh pr create → PR URL captured
│
▼
OUTPUT (PR URL + assembled body)
Each gate exists because something can plausibly go wrong at this step, and the right move when something does is to stop and tell the human, not to push through. Bugs that can't be reproduced get logged and stopped. Plans the committee rejects get logged and stopped. Patches that don't pass post-fix regression get logged and stopped. A clean exit at any phase is the success case for that phase's gate. The failure case is "merging despite the warning."
02 / reproduce first, fix later
The first phase is /repro-bug. Before the agent
considers writing a single line of fix, it has to
demonstrate that the bug exists with a runnable script that
produces the failing assertion against a real environment.
The output of this phase is a repro-report.md
file plus a checked-in reproduction script. Both ride along
into the final PR so the reviewer can run it themselves.
This is the single most important constraint in the whole pipeline. A surprising number of "bugs" reported via ticket aren't actually bugs in the system as it exists. They're misremembered behavior, environment confusion, or a missing feature being mistaken for a regression. If you skip the reproduce step you waste the rest of the pipeline. Worse, you generate plausible-looking fixes for non-bugs, which is a much harder failure mode for a reviewer to catch than "the agent gave up."
If /repro-bug can't reproduce the issue,
/auto-fix-bug stops with
block_no_repro and emits the agent's
investigation notes for the operator. That happens often,
and it's fine. That's the gate doing its job.
03 / plan and two-stage committee review
This phase is where the design gets interesting. The agent doesn't just generate a fix. It generates a plan for a fix, and the plan goes through two stages of multi-agent review before any code is touched.
Stage A. Three parallel specialists, deterministic rubrics. Three reviewer agents run in parallel, each applying a different deterministic rubric:
/review-blast-radius: applies Hyrum's Law thinking. "All observable behaviors of your system will be depended on by somebody." The reviewer's job is to find load-bearing implicit contracts the fix might disturb./review-root-cause: applies a 5 Whys-style probe. "Does this plan address the root cause, or is it a band-aid one layer up?"/review-minimalism: applies YAGNI. "Does the plan do more than the bug requires? Strip everything that isn't necessary to make the failing assertion pass."
Each specialist emits a structured verdict: PASS,
CONCERN, VETO, or SKIP.
Any VETO sends the drafter agent back to revise
the plan. Stage A re-runs against the revised plan. If
specialists keep vetoing across 3 iterations, the orchestrator
stops with block_reviewer_irreconcilable. That's
not the agents disagreeing, that's the agents collectively
saying "this bug isn't safely fixable by a small patch."
Which is a real signal.
Stage B. Three identical generalist reviewers, holistic vote. Once Stage A clears (no vetoes), the plan and patch go to a second committee: three identical reviewer agents with the same prompt, voting holistically on the (possibly revised) plan + patch. Two of three must approve. A 2-1 split passes with the dissent annotated in the PR body; a 1-2 or 0-3 split rejects.
Stage A is specialist: three different rubrics, deliberately diverse perspectives, deterministic logic. Stage B is generalist: same prompt, three samples, voting for stability against single-LLM variance. You catch different kinds of mistakes with different review structures, and you want both.
When Stage A and Stage B both clear, the agent applies the patch to a dedicated git worktree (not the developer's main checkout) and proceeds.
04 / verify
/verify-fix re-runs the reproduction script
from Phase 1 against the patched worktree, three sequential
times. All three runs must pass for the gate to clear. Three
is enough to catch the basic flakiness modes. If a "fix"
passes on first run but fails on second, that's almost
always a real problem (the fix introduced timing
variability, or only works on a clean state and not after
re-entry).
If verification fails, the worktree is preserved for
inspection but the orchestrator stops with
block_fix_failed_verify. The agent does NOT
iterate on the fix here. That's intentional: at this phase
we already passed plan review, so a failure to verify means
something we didn't anticipate. Better to stop and surface
the evidence than to start improvising patches.
05 / manual exploratory QA and regression
Phases 4 and 5 delegate to the manual-testing agent (from post #1) and a regression-check skill. The manual-testing agent runs in autonomous mode with scope restricted to the changed files in the patch. It tests behavior around the fix, not the targeted bug itself (which we already proved is fixed). Looks for: input-fuzzing on changed endpoints, role-boundary on touched routes, adjacent code paths that share helpers, integration with surrounding system.
Regression-check runs a chosen slice of the regression suite twice: once on the pre-fix baseline, once on the post-fix patched worktree. The diff between the two runs is the actual signal. If the post-fix run has more failing specs than the pre-fix baseline, the fix introduced a regression and we stop. If the post-fix has fewer or equal failures (and no new ones), we proceed.
Comparing a fix's regression run against a baseline from the same moment is much stronger than comparing against historical pass rates. Environments drift. Flakes happen. A post-fix vs. pre-fix diff isolates the fix's contribution.
06 / commit, push, pre-PR audit
Once Phases 1 through 5 pass, the orchestrator commits in the worktree, pushes the branch, and runs a pre-PR audit that checks:
- The branch matches the expected naming pattern.
- No stray commits from unrelated work.
- No uncommitted leftovers.
Auto-bug-fixer branches almost never fail this gate (we work in a dedicated worktree, no contamination is possible), but the check exists because something deeper has gone wrong if it does, and silently opening a PR full of unrelated commits is worse than stopping with a clear error.
07 / deep diff review
/review-pr runs an automated review against
the pushed branch, tiered by diff size:
- INLINE (<200 lines changed): no sub-agents. The orchestrator does mechanical checks plus a light LLM pass inline. ~1 minute. Most auto-bug-fixer fixes land here because of the YAGNI gate in plan review.
- FAST (200 to 500 lines): spawns Haiku-class reviewer agents in parallel. ~3 to 5 min.
- FULL (500+ lines): Sonnet-class reviewers. ~5 to 8 min.
The review verdict gates PR creation: YES
passes; YES WITH CONDITIONS passes with
concerns surfaced in the PR body for the human reviewer;
NO stops with block_review_critical
and deletes the just-pushed branch (an orphaned remote
branch with a rejected change is worse than no branch).
08 / open PR
The PR body is assembled by concatenating the per-phase report files, in a fixed order:
## Summary ← from repro-report.md ¶1 + action-plan.md ¶1
## Replication ← from repro-report.md
## Changes ← from action-plan.md
## Regression (pre) ← from regression-baseline-report.md
## Manual QA ← from manual-testing-report.md
## Regression (post) ← from regression-postfix-report.md
## Risks & rollback ← from action-plan.md
## Code review ← from review-pr-report.md
The orchestrator emits the section headers. Everything else is verbatim from the report files written by the phases. Claude prose in stdout is for the operator. The report files are for downstream consumption. Same principle that makes the manual-testing and triage agents from the earlier posts work: structured output is a contract, prose is not.
09 / design choices worth stealing
A few patterns I'd repeat on any future agent pipeline like this:
- Worktree isolation per fix. Git worktrees are a free, lightweight way to give each automated change a clean room. The developer's main checkout is never touched. The worktree gets deleted if the pipeline stops at any gate.
- Reports as inter-phase contracts. Phase N writes a structured
.mdfile; phase N+1 reads it. The orchestrator never tries to summarize Phase N's output from memory. If a phase's output isn't in a file, downstream phases can't use it. - Clean stops at every gate. The orchestrator stops on the first gate that says stop. It does NOT try to recover, retry, or improvise. Failure modes are surfaced verbatim. Each
block_*code is greppable and routable. - Multi-agent review with diverse rubrics. Three specialists with different prompts catch different problems than three samples of the same prompt. Use both: specialists for diversity, identical-prompt committee for stability.
- Tiered review by diff size. The same review rigor doesn't make sense for a one-line typo fix and a 700-line refactor. Tier the reviewer agents (inline, haiku, sonnet) and let the diff size pick.
- YAGNI as a structural gate, not advice. The minimalism reviewer in Stage A doesn't just prefer small diffs. It can VETO if the plan does more than the bug requires. That single rule keeps auto-bug-fixer PRs almost always in the INLINE tier, which keeps review fast, which keeps the trust feedback loop tight.
- Pre-PR audit is mandatory. Even when you "know" the branch is clean (because you created it yourself in a dedicated worktree), run the audit. The branch you "know" is clean is exactly the branch that's not.
10 / the economics, again
I've covered this in the previous posts: the agent runs against subscription auth in a Docker container, so the marginal cost per PR is effectively zero. What matters is the trust economics, not the dollar economics. A pipeline like this is acceptable when:
- It opens small, single-concern PRs (the YAGNI gate enforces this).
- Every PR ships with a runnable reproduction and a clean regression diff.
- The human reviewer has the same information the agent had, plus the agent's own audit trail.
- Failures stop instead of pushing through.
It is not acceptable as a "ship-it-while-I-sleep" button, and the gates are there to make that distinction unambiguous.
11 / closing, and the series so far
Three posts, three Claude Code agents, one infrastructure pattern:
- Deploy-time manual testing: an observer agent that watches every deploy.
- Failure triage: an observer agent that watches every failed regression run.
- Auto-bug-fixer: an actor agent that writes patches and opens PRs.
The two observers are easy to deploy because their blast radius is small. The third is the bold one, and the architecture is shaped to bound the boldness: committee reviews with deterministic rubrics, worktree isolation, mechanical PR-body assembly, clean stops at every gate, the same regression suite running pre-fix and post-fix to isolate the fix's contribution.
The thing that surprises people when I describe this is that most of the engineering isn't AI engineering. It's pipeline engineering: file contracts between phases, gate logic, branch hygiene, structured outputs, audit trails. The AI is the controller of each phase. The pipeline is what makes the controller's output safe to ship.
The next post, if I write one, is probably about the workbench repo as a living institutional asset. How slash commands compose, how skills mature over time, how a QA org's knowledge becomes searchable, diffable, reviewable. The infrastructure pattern is mostly the same. The lens shifts from what the agents do to what the workbench becomes when you maintain it for a year.