CASE STUDY · MANUAL QA AGENT

Agentic QA
for the AI-drafted SDLC.

A tester that behaves like a senior QA engineer. Point it at a commit, ticket, or spec. It derives test scenarios, executes them across blackbox / DB / API layers, and returns a findings report with evidence. Disposable tests for a world where maintained script libraries can't keep up with shipping speed.

year: 2025, ongoing role: sole developer scope: multi-app, single repo

Node.js / TypeScript
Playwright
Anthropic Claude
PostgreSQL
REST API
pgvector
Slack API

01 / the thesis

Traditional test automation is a script library tightly coupled to implementation. It's a good fit when code moves slowly. When AI drafts specs, writes code, and ships in hours, the coupled library becomes technical debt that compounds daily.

The response isn't to replace automation. It's to build two complementary layers:

Invariant tests (always-on, stable): guard the rules that shouldn't change, regardless of implementation.
This agent (on-demand, disposable): re-derives tests from the current spec every run. Nothing to maintain because nothing is persisted.

This case study is about the second. It is not a test automation framework. It is not a script runner. It is a tester that can be instructed the same way you would instruct a person: give it a commit, a ticket, or a requirement doc. It will test the feature, find bugs, and report back. It does not fix bugs. It does not write production code. It finds and reports. Everything else belongs to the developer or the QA lead.

02 / end-to-end flow

Input
  └── commit ID, requirement doc, ticket, or any reference

      ↓

Extract Requirements
  └── read the diff or reference
  └── cross-reference the knowledge base for domain context

      ↓

Draft Test Scenarios
  └── derive scenarios from requirements
  └── apply testing heuristics: boundary values, equivalence
      partitioning, role enforcement, edge cases
  └── apply risk weighting: how critical is this area

      ↓

Execute Tests
  └── blackbox: what the user sees, browser-based
  └── DB checks: verify state after any state-changing action
  └── API tests: verify enforcement at the network layer
  └── whitebox: selective code review where risk warrants it

      ↓

Output
  └── findings report with pass/fail per scenario
  └── evidence: screenshots, response bodies, DB query results
  └── exact replication steps for each finding

03 / testing approach

blackbox first

Testing starts from the user's perspective. What does the user see? What can they do? What should happen? This is the primary lens because it catches what a real user would encounter.

then the data layer

The UI can look correct while the data underneath is wrong. Any state-changing action is verified at the DB level. This catches the class of bugs that look fine on screen but surface later as billing disputes, wrong charges, or data corruption.

then the API layer

Verify enforcement at the network layer, not just the UI. The UI is a courtesy. The backend is the truth. Role enforcement, input validation, business rules must all hold at the API level regardless of what the UI shows.

whitebox, selectively

Code review to spot missing validation, incorrect logic, or implementation gaps. Applied selectively where risk warrants it. This overlaps with developer responsibility and isn't the primary job.

04 / knowledge as infrastructure

The tester's effectiveness is directly proportional to what it knows about the system. Knowledge is organized into four areas, each with a distinct role:

Skills: the testing methodology. How to test, phases, heuristics, what to look for. Applies universally across features.
Domain (formal): deliberate documentation of how the application works. Feature areas, business rules, data flows, user roles. The source of truth.
Domain (learned): knowledge discovered during actual test runs. Non-obvious behaviors, gotchas, bug history, architectural surprises. Accumulated over time. Each entry represents something that was not obvious from the code or spec.
Environment: environments, credentials, tools, how to connect. Plus learned gotchas during setup and execution.

05 / the staging gate (the novel bit)

Findings from test runs are noteworthy, but not all findings deserve a permanent place in the knowledge base. False positives, transient environment state, and script infrastructure issues will poison the corpus if ingested blindly.

So findings don't go directly to the KB. They go to a staging queue (POST /knowledge/staging). A human reviews and approves each entry through an orchestrator UI before it's promoted.

DECISION · STAGING, NOT DIRECT WRITE

False positives are the #1 source of KB noise. A staging gate forces every addition through human review. The KB grows slower but stays trustworthy. Trustworthy is load-bearing. The tester's effectiveness depends on it.

06 / guardrails

Agentic systems fail in specific ways. Each failure mode has an explicit rail:

Production-safety ceiling: never complete a checkout on production if the total exceeds a defined threshold. Use test cards or low-value items; reroute higher totals to staging.
Never stage unconfirmed findings: every KB entry must be backed by investigation. Confirm it's a real bug (not a script bug, false positive, or environment issue) before staging.
Never stage transient state: health statuses, uptimes, 502s from deploys, recent restarts. These are ephemeral and will be wrong by the next run.
Never stage script fixes: broken selectors, test infrastructure issues. Those are fixes to our code, not learnings about the app.
Never store domain knowledge in Claude memory: the repo is shared across users and sessions. Application knowledge, business rules, and test findings go through the staging pipeline, not auto-memory.

07 / what I learned

Agentic QA only works with a real knowledge base underneath. Without it, the agent reinvents the same tests every run and misses the context a human would carry across sessions.
Staging gates are the difference between a KB that improves and one that accumulates noise. Noise compounds fast; trustworthy corpora need discipline.
Agents fail loudly in ways humans don't. The guardrails are not suggestions; they are load-bearing.
Disposable tests are a feature, not a bug. Nothing to maintain means the cost of coverage changes from O(features * time) to O(features).
This agent pairs with invariant tests. Neither is enough alone: invariants miss new features, the agent misses rules it hasn't been asked to check. Together they cover what a full QA function covers.

TAKEAWAY

When the SDLC is AI-drafted, the testing layer has to be two things at once: a stable substrate that doesn't change (invariants), and a re-derivable instrument that doesn't need maintenance (this agent). Treat the knowledge base as infrastructure, gate what enters it, and let the humans decide what ships.