back to writing
POST · 2026.02.13

Manual QA without manual labor:
a Claude Code agent for deploy-time exploratory testing.

Every merge needs eyes on it. Every deploy needs someone to ask, did this change break anything I'm not testing? I built an agent that runs that procedure for me, on every deploy, autonomously, in a Docker container. Ten minutes of unattended runtime. Flat-rate cost.

TL;DR

Across a busy week, deploy-time triage is hours of human work per developer-merge. Read the diff, open the app, click around, look for the thing the unit tests can't catch.

For a long time my answer was "be faster, write better automated tests, get used to it." It's the answer most QA engineers settle for. Then I noticed something. The work I do during manual exploratory testing is structured. It's literally a checklist of heuristics applied to a diff. Read the changed files. Identify the subsystem. Hit the endpoint. Try a malformed payload. Check the role boundary. Verify the read-after-write. That's not a creative act. It's a procedure I run from muscle memory.

So I built an agent that runs the procedure for me. It produces a structured report a developer can read in two minutes and either say "all good" or "I need to look at this." It costs about ten minutes of unattended runtime per deploy and a flat-rate Anthropic subscription. It catches things my regression suite doesn't, because exploratory testing covers different surface than scripted testing.

This is the architecture, the design choices, and the parts of it that I think are reusable across other QA orgs.

01 / Claude Code is not a chatbot. It's a shell.

Most people I talk to think "AI for testing" means a chat tool that suggests test cases, or a code-completion plugin that writes Playwright specs. Both of those exist. Neither is what unlocked this.

Claude Code is a CLI agent. You run it with claude in your terminal, and it can read files, edit code, execute shell commands, fetch web resources, query databases. It's much closer to a programmable shell with an LLM brain than to a chatbot. The interactive mode is what most users see. You have a conversation, the agent does work between your turns.

But there's a second mode, and that's where this story starts:

claude --print --dangerously-skip-permissions "/some-slash-command arg=value"

--print makes it non-interactive: one prompt in, one final response on stdout, exit. No chat loop. --dangerously-skip-permissions means the agent doesn't pause to confirm each tool use, which is necessary when there's no human to confirm. Combine the two and you have a containerized AI agent that runs a domain-specific task to completion and writes its result to stdout. Same auth, same tools, same model as the interactive mode. Just headless.

The other piece is slash commands. A slash command in Claude Code is a markdown file under .claude/commands/<name>.md that defines a reusable, parameterized prompt. The body of the file becomes the system prompt when you type /<name> in the chat. Slash commands are first-class. They show up in autocomplete, they accept arguments, they can declare which tools they're allowed to use via frontmatter.

THE UNLOCK

The same slash command file powers both modes. Interactively, a QA engineer types it and reviews the agent's work. Headlessly, a Docker container fires it on every deploy. One file, two consumers.

02 / the workbench concept

I keep a single repo I call my QA workbench. It contains:

This repo is the substrate. Slash commands are the domain DSL. Skills are the implementation. Knowledge is the corpus that informs both.

The trick is the same repo is consumed two ways:

  1. Interactively. I open Claude Code in this repo, type a slash command, and Claude Code drives the workflow. I'm in the loop. I confirm each phase.
  2. Autonomously. A Docker container bind-mounts the repo read-only, runs claude --print "/<command>-headless", and produces a report.

The interactive flow was where I built the slash commands in the first place. You can't write a good autonomous agent prompt without first running the workflow many times with a human in the loop, watching where the AI gets confused, and tightening the prompt. Human-in-the-loop is the training ground for the autonomous flow. Skip it and your autonomous agent will produce nonsense at scale.

03 / the autonomous flow, in detail

Here's the full lifecycle of a single autonomous run.

Trigger. CI/CD completes a deploy to a test environment. My orchestrator (a small Node service that ingests CI events) creates a daily task with a title like Manual Test: <build-label> (<environment>) and inserts a row into a manual_test_runs queue table with status='queued'. A single in-process worker polls the queue every thirty seconds. It refuses to pick up a new row while one is already running. Single concurrency, intentional.

Spawn. When the worker picks a queued row, it issues docker run against a pre-built image I'll call qa-agent:latest. The image is small (~3GB): it has the Claude Code CLI, GitHub CLI, jq, a MySQL client, Playwright with browsers, and the slash command itself baked into /workspace/.claude/commands/. It does not bake in the source repos under test. Those are bind-mounted at runtime from a centralized host directory (/srv/repos). Every run sees the latest HEAD because the worker git pulls before spawning. Skipping the bake means rebuilding the image is rare (only when the slash command changes), and the image stays small.

The full docker run includes:

Execute. Inside the container, the entrypoint copies the mounted credentials into the non-root user's home, runs gh auth login, then execs:

timeout 60m runuser -u qauser -- \
  claude --print --dangerously-skip-permissions "/manual-testing-headless"

The agent reads the playbook from /repos/manual-qa/skills/SKILL.md, parses the deploy's commits to determine scope, reads the changed source files, builds a small test plan (input fuzzing on changed endpoints, role-boundary tests, integration probes, deploy- artifact verification on the shipped JS bundle), executes them, and emits a structured YAML verdict block followed by a markdown report.

Report. The agent's stdout starts with a one-line summary the orchestrator can regex-parse:

ONE-LINE SUMMARY: MANUAL_QA_CONCERNS (1 concerns, 0 blockers)

Followed by the full report with structured sections (Verdict / Diff Scope / Tests Run / Concerns / Blockers / Suggested next steps). The worker captures stdout, persists it to the database, posts a Slack notification with a verdict-colored attachment, and uploads the full markdown report as a thread file.

The Slack message is the QA engineer's actual interface to the system. Not the database, not the daily-task UI. A colored bar (amber for concerns, red for blockers, green for clean, gray for skipped) plus a bulleted summary plus an @-mention only when a real issue is found. Most reports get a glance and a thumbs-up. The ones that don't get a real conversation.

04 / the economics: subscription auth in a container

The single design choice I want to highlight is subscription auth in the container.

The default automation pattern with any AI provider is: provision an API key, set the key as an env var on your worker, pay per token. For Claude that's roughly $3 per million input tokens and $15 per million output tokens (Sonnet-class). For an exploratory testing agent that reads dozens of source files and runs many probes, a single run can consume hundreds of thousands of tokens. Per-run cost: $0.50 to $2.

Multiply by a few deploys a day across two apps and you're at $40 to $120 per month. Not huge, but the cost scales linearly with usage.

The alternative I use is to mount the host's ~/.claude directory (which holds an OAuth credentials file from claude auth login against a subscription account) into the container. The container inherits the auth state and is billed against the subscription's flat-rate plan. Cost is fixed regardless of how much testing happens.

DECISION · FLAT-RATE OVER PER-TOKEN

Trade-offs are real. One account, one rate-limit ceiling. If too many agents fire simultaneously, you bottleneck on Anthropic's quota. The account is tied to a real person. If they leave or revoke, everything stops. Single-tenant by design. You can't fan this out to a multi-customer SaaS.

For an internal tool serving one team, those are acceptable. For my use case (one or two deploys per hour, single-concurrency worker), the rate limit is never an issue and the flat-rate cost is the win.

05 / design choices worth stealing

A few patterns I'd repeat on another project:

06 / what this unlocks

The headline isn't "AI does QA now." The headline is QA stops being the bottleneck at deploy time. Engineering ships, the agent provides a same-shift report, and the QA engineer (me) spends the saved time on the work that matters: writing better invariants, deepening the test playbooks, improving the workbench repo itself.

There are second-order effects that surprised me:

07 / what's next

The same containerized-Claude-Code pattern applies to other parts of QA. The next piece I'm writing about is failure triage: an agent that, when a regression run fails, automatically re-runs the failing spec, reads the merged PR's diff, and writes a one-paragraph diagnosis explaining whether the failure is flaky, environment-related, a test bug, or a real product regression. It runs on the same docker-agent infrastructure described above, just with a different slash command and a different scope. The architecture transfers cleanly.

THE TAKEAWAY

If you're a QA engineer reading this and thinking "this is too much engineering for what I do," that was me a year ago. Try writing one slash command that captures one workflow you do all the time. Run it interactively a dozen times until it produces output you trust. Then put it in a container and trigger it from CI. The infrastructure I described in this post took maybe two weekends of work in total. The leverage compounds from there.

STATUSOPEN TO WORK
PAGE/blog/manual-qa-without-manual-labor
--:--:--