back to projects
CASE STUDY · MANUAL QA AGENT

Agentic QA
for the AI-drafted SDLC.

A tester that behaves like a senior QA engineer. Point it at a commit, ticket, or spec. It derives test scenarios, executes them across blackbox / DB / API layers, and returns a findings report with evidence. Disposable tests for a world where maintained script libraries can't keep up with shipping speed.

01 / the thesis

Traditional test automation is a script library tightly coupled to implementation. It's a good fit when code moves slowly. When AI drafts specs, writes code, and ships in hours, the coupled library becomes technical debt that compounds daily.

The response isn't to replace automation. It's to build two complementary layers:

This case study is about the second. It is not a test automation framework. It is not a script runner. It is a tester that can be instructed the same way you would instruct a person: give it a commit, a ticket, or a requirement doc. It will test the feature, find bugs, and report back. It does not fix bugs. It does not write production code. It finds and reports. Everything else belongs to the developer or the QA lead.

02 / end-to-end flow

Input
  └── commit ID, requirement doc, ticket, or any reference

      ↓

Extract Requirements
  └── read the diff or reference
  └── cross-reference the knowledge base for domain context

      ↓

Draft Test Scenarios
  └── derive scenarios from requirements
  └── apply testing heuristics: boundary values, equivalence
      partitioning, role enforcement, edge cases
  └── apply risk weighting: how critical is this area

      ↓

Execute Tests
  └── blackbox: what the user sees, browser-based
  └── DB checks: verify state after any state-changing action
  └── API tests: verify enforcement at the network layer
  └── whitebox: selective code review where risk warrants it

      ↓

Output
  └── findings report with pass/fail per scenario
  └── evidence: screenshots, response bodies, DB query results
  └── exact replication steps for each finding

03 / testing approach

blackbox first

Testing starts from the user's perspective. What does the user see? What can they do? What should happen? This is the primary lens because it catches what a real user would encounter.

then the data layer

The UI can look correct while the data underneath is wrong. Any state-changing action is verified at the DB level. This catches the class of bugs that look fine on screen but surface later as billing disputes, wrong charges, or data corruption.

then the API layer

Verify enforcement at the network layer, not just the UI. The UI is a courtesy. The backend is the truth. Role enforcement, input validation, business rules must all hold at the API level regardless of what the UI shows.

whitebox, selectively

Code review to spot missing validation, incorrect logic, or implementation gaps. Applied selectively where risk warrants it. This overlaps with developer responsibility and isn't the primary job.

04 / knowledge as infrastructure

The tester's effectiveness is directly proportional to what it knows about the system. Knowledge is organized into four areas, each with a distinct role:

05 / the staging gate (the novel bit)

Findings from test runs are noteworthy, but not all findings deserve a permanent place in the knowledge base. False positives, transient environment state, and script infrastructure issues will poison the corpus if ingested blindly.

So findings don't go directly to the KB. They go to a staging queue (POST /knowledge/staging). A human reviews and approves each entry through an orchestrator UI before it's promoted.

DECISION · STAGING, NOT DIRECT WRITE

False positives are the #1 source of KB noise. A staging gate forces every addition through human review. The KB grows slower but stays trustworthy. Trustworthy is load-bearing. The tester's effectiveness depends on it.

06 / guardrails

Agentic systems fail in specific ways. Each failure mode has an explicit rail:

07 / what I learned

TAKEAWAY

When the SDLC is AI-drafted, the testing layer has to be two things at once: a stable substrate that doesn't change (invariants), and a re-derivable instrument that doesn't need maintenance (this agent). Treat the knowledge base as infrastructure, gate what enters it, and let the humans decide what ships.

STATUSOPEN TO WORK
PAGE/projects/manual-qa-agent
--:--:--