POST · 2026.02.13

Manual QA without manual labor:
a Claude Code agent for deploy-time exploratory testing.

Every merge needs eyes on it. Every deploy needs someone to ask, did this change break anything I'm not testing? I built an agent that runs that procedure for me, on every deploy, autonomously, in a Docker container. Ten minutes of unattended runtime. Flat-rate cost.

date: 2026.02.13 reading: ~12 min tags: qa, agentic ai, claude code, docker, ci/cd

TL;DR

Exploratory testing is mostly a procedure: read the diff, hit the endpoint, try a malformed payload, check role boundaries. That procedure compresses into a slash command.
The same slash command file powers both interactive use (QA engineer in the loop) and a headless run (Docker container fired by CI). Single source of truth.
Mounting a host subscription's ~/.claude directory into the container makes the cost flat-rate instead of per-token. Doubling deploys doesn't double the bill.
Verdict-colored Slack messages are the actual UX. Most reports get a glance and a thumbs-up. The few that don't get a real conversation.

Across a busy week, deploy-time triage is hours of human work per developer-merge. Read the diff, open the app, click around, look for the thing the unit tests can't catch.

For a long time my answer was "be faster, write better automated tests, get used to it." It's the answer most QA engineers settle for. Then I noticed something. The work I do during manual exploratory testing is structured. It's literally a checklist of heuristics applied to a diff. Read the changed files. Identify the subsystem. Hit the endpoint. Try a malformed payload. Check the role boundary. Verify the read-after-write. That's not a creative act. It's a procedure I run from muscle memory.

So I built an agent that runs the procedure for me. It produces a structured report a developer can read in two minutes and either say "all good" or "I need to look at this." It costs about ten minutes of unattended runtime per deploy and a flat-rate Anthropic subscription. It catches things my regression suite doesn't, because exploratory testing covers different surface than scripted testing.

This is the architecture, the design choices, and the parts of it that I think are reusable across other QA orgs.

01 / Claude Code is not a chatbot. It's a shell.

Most people I talk to think "AI for testing" means a chat tool that suggests test cases, or a code-completion plugin that writes Playwright specs. Both of those exist. Neither is what unlocked this.

Claude Code is a CLI agent. You run it with claude in your terminal, and it can read files, edit code, execute shell commands, fetch web resources, query databases. It's much closer to a programmable shell with an LLM brain than to a chatbot. The interactive mode is what most users see. You have a conversation, the agent does work between your turns.

But there's a second mode, and that's where this story starts:

claude --print --dangerously-skip-permissions "/some-slash-command arg=value"

--print makes it non-interactive: one prompt in, one final response on stdout, exit. No chat loop. --dangerously-skip-permissions means the agent doesn't pause to confirm each tool use, which is necessary when there's no human to confirm. Combine the two and you have a containerized AI agent that runs a domain-specific task to completion and writes its result to stdout. Same auth, same tools, same model as the interactive mode. Just headless.

The other piece is slash commands. A slash command in Claude Code is a markdown file under .claude/commands/<name>.md that defines a reusable, parameterized prompt. The body of the file becomes the system prompt when you type /<name> in the chat. Slash commands are first-class. They show up in autocomplete, they accept arguments, they can declare which tools they're allowed to use via frontmatter.

THE UNLOCK

The same slash command file powers both modes. Interactively, a QA engineer types it and reviews the agent's work. Headlessly, a Docker container fires it on every deploy. One file, two consumers.

02 / the workbench concept

I keep a single repo I call my QA workbench. It contains:

.claude/commands/: about thirty slash commands covering my common workflows. /manual-testing, /investigate-failures, /regression-test, /smoke-test, /triage-slack, /provision-accounts, and so on.
skills/: the actual testing playbook. SKILL.md describes the testing phases (input fuzzing, role-boundary, error-path, integration, read-after-write). EXECUTION.md describes the sequence and the report format. default-scenarios.md lists common heuristics to apply.
knowledge/: a knowledge base I've accumulated. Environment-specific gotchas, non-obvious behaviors I've learned the hard way, things every test plan needs to account for.
environment/: per-environment configuration notes (URLs, credentials reference, known constraints).
tools/: small helper scripts (daily-task interactions, test-account provisioning).

This repo is the substrate. Slash commands are the domain DSL. Skills are the implementation. Knowledge is the corpus that informs both.

The trick is the same repo is consumed two ways:

Interactively. I open Claude Code in this repo, type a slash command, and Claude Code drives the workflow. I'm in the loop. I confirm each phase.
Autonomously. A Docker container bind-mounts the repo read-only, runs claude --print "/<command>-headless", and produces a report.

The interactive flow was where I built the slash commands in the first place. You can't write a good autonomous agent prompt without first running the workflow many times with a human in the loop, watching where the AI gets confused, and tightening the prompt. Human-in-the-loop is the training ground for the autonomous flow. Skip it and your autonomous agent will produce nonsense at scale.

03 / the autonomous flow, in detail

Here's the full lifecycle of a single autonomous run.

Trigger. CI/CD completes a deploy to a test environment. My orchestrator (a small Node service that ingests CI events) creates a daily task with a title like Manual Test: <build-label> (<environment>) and inserts a row into a manual_test_runs queue table with status='queued'. A single in-process worker polls the queue every thirty seconds. It refuses to pick up a new row while one is already running. Single concurrency, intentional.

Spawn. When the worker picks a queued row, it issues docker run against a pre-built image I'll call qa-agent:latest. The image is small (~3GB): it has the Claude Code CLI, GitHub CLI, jq, a MySQL client, Playwright with browsers, and the slash command itself baked into /workspace/.claude/commands/. It does not bake in the source repos under test. Those are bind-mounted at runtime from a centralized host directory (/srv/repos). Every run sees the latest HEAD because the worker git pulls before spawning. Skipping the bake means rebuilding the image is rare (only when the slash command changes), and the image stays small.

The full docker run includes:

--add-host=host.docker.internal:host-gateway so the container can reach the orchestrator's HTTP API on the host.
-v $HOME/.claude:/root/.claude:ro so the container inherits the host's Claude Code login (more on auth below).
-v /srv/repos:/repos:ro for source code access.
Environment variables for the deploy context: project, environment, the commits in the deploy as JSON, the daily-task ID, a wall-clock budget cap.
--security-opt=no-new-privileges to block setuid escalation; the container runs as a non-root user.

Execute. Inside the container, the entrypoint copies the mounted credentials into the non-root user's home, runs gh auth login, then execs:

timeout 60m runuser -u qauser -- \
  claude --print --dangerously-skip-permissions "/manual-testing-headless"

The agent reads the playbook from /repos/manual-qa/skills/SKILL.md, parses the deploy's commits to determine scope, reads the changed source files, builds a small test plan (input fuzzing on changed endpoints, role-boundary tests, integration probes, deploy- artifact verification on the shipped JS bundle), executes them, and emits a structured YAML verdict block followed by a markdown report.

Report. The agent's stdout starts with a one-line summary the orchestrator can regex-parse:

ONE-LINE SUMMARY: MANUAL_QA_CONCERNS (1 concerns, 0 blockers)

Followed by the full report with structured sections (Verdict / Diff Scope / Tests Run / Concerns / Blockers / Suggested next steps). The worker captures stdout, persists it to the database, posts a Slack notification with a verdict-colored attachment, and uploads the full markdown report as a thread file.

The Slack message is the QA engineer's actual interface to the system. Not the database, not the daily-task UI. A colored bar (amber for concerns, red for blockers, green for clean, gray for skipped) plus a bulleted summary plus an @-mention only when a real issue is found. Most reports get a glance and a thumbs-up. The ones that don't get a real conversation.

04 / the economics: subscription auth in a container

The single design choice I want to highlight is subscription auth in the container.

The default automation pattern with any AI provider is: provision an API key, set the key as an env var on your worker, pay per token. For Claude that's roughly $3 per million input tokens and $15 per million output tokens (Sonnet-class). For an exploratory testing agent that reads dozens of source files and runs many probes, a single run can consume hundreds of thousands of tokens. Per-run cost: $0.50 to $2.

Multiply by a few deploys a day across two apps and you're at $40 to $120 per month. Not huge, but the cost scales linearly with usage.

The alternative I use is to mount the host's ~/.claude directory (which holds an OAuth credentials file from claude auth login against a subscription account) into the container. The container inherits the auth state and is billed against the subscription's flat-rate plan. Cost is fixed regardless of how much testing happens.

DECISION · FLAT-RATE OVER PER-TOKEN

Trade-offs are real. One account, one rate-limit ceiling. If too many agents fire simultaneously, you bottleneck on Anthropic's quota. The account is tied to a real person. If they leave or revoke, everything stops. Single-tenant by design. You can't fan this out to a multi-customer SaaS.

For an internal tool serving one team, those are acceptable. For my use case (one or two deploys per hour, single-concurrency worker), the rate limit is never an issue and the flat-rate cost is the win.

05 / design choices worth stealing

A few patterns I'd repeat on another project:

Centralize source repos at /srv/repos on the host; bind-mount, don't bake. Image stays small. git pull on the host = every agent run sees fresh HEAD. Adding a new repo doesn't require rebuilding the image.
Single-concurrency worker for resource-heavy AI runs. Avoids Slack-channel storms, contention on shared test accounts, and runaway parallel costs. If you need throughput, queue up. Don't fan out.
Circuit breakers for env-wide failures. If the regression suite that ran 30 seconds before the agent had 20 of 30 specs fail (the env is down), the agent should skip with MANUAL_QA_SKIPPED and not waste an hour testing nothing.
SKIPPED as a first-class verdict. Agents need to know when not to test. Out-of-scope deploys (internal-only tooling needing elevated roles the agent doesn't have, docs-only changes, version bumps) should exit cleanly in under a minute with an explanation.
Self-paced wall-clock cap. Tell the agent its budget in an env var. Add a rule in the slash command: "If you're five minutes from the cap, stop the current test, write what you have, exit." Then enforce with timeout in the entrypoint. The agent gracefully wraps up rather than getting SIGKILL'd mid-report.
Verdict-colored Slack attachments, one per spec or finding. Slack's color-stripe attachments are the cheapest UX win there is. A red bar next to a finding tells a developer "open me" without reading a word.

06 / what this unlocks

The headline isn't "AI does QA now." The headline is QA stops being the bottleneck at deploy time. Engineering ships, the agent provides a same-shift report, and the QA engineer (me) spends the saved time on the work that matters: writing better invariants, deepening the test playbooks, improving the workbench repo itself.

There are second-order effects that surprised me:

The workbench repo is now institutional QA knowledge, version-controlled. New hires don't shadow me for two weeks. They read the slash commands and the skills directory. The corpus is searchable, diffable, reviewable.
Reports are context-aware. The agent reads the actual PR diff. When it says "this is unrelated to the merged PR," it's because it checked. That alone catches the biggest failure mode of human triage: blaming the most recent PR by reflex.
Cost is predictable. Flat-rate subscription. Doubling deploy frequency doesn't double the bill. CFOs appreciate this.

07 / what's next

The same containerized-Claude-Code pattern applies to other parts of QA. The next piece I'm writing about is failure triage: an agent that, when a regression run fails, automatically re-runs the failing spec, reads the merged PR's diff, and writes a one-paragraph diagnosis explaining whether the failure is flaky, environment-related, a test bug, or a real product regression. It runs on the same docker-agent infrastructure described above, just with a different slash command and a different scope. The architecture transfers cleanly.

THE TAKEAWAY

If you're a QA engineer reading this and thinking "this is too much engineering for what I do," that was me a year ago. Try writing one slash command that captures one workflow you do all the time. Run it interactively a dozen times until it produces output you trust. Then put it in a container and trigger it from CI. The infrastructure I described in this post took maybe two weekends of work in total. The leverage compounds from there.