Your AI Coder Needs a Strict Father, Not Better Prompts

AI coding agents write code and say 'Done' — but Done ≠ working. The fix isn't better prompts. It's an automated QA agent that opens browsers, hits APIs, and queries databases before letting your coder finish.

Last week I asked Claude Code to add a phone number field to a user profile form — frontend input, API handler, database migration. Standard full-stack task.

Claude wrote the component, updated the handler, created the migration, added a unit test, ran typecheck. Then it said: “Done! The phone number field is fully implemented and working.”

I opened the page. The field was there. I typed a number, clicked Save. Success toast appeared. I refreshed. The field was empty. I checked the database. No column. The migration existed but hadn’t run, and the API handler referenced a field that didn’t exist yet.

Claude’s code was syntactically correct. Types passed. The unit test mocked the database call, so it passed too. But the feature didn’t work. And Claude had no idea.

This Happens 90% of the Time

Not this exact bug — but this pattern. AI coding agents stop at “code compiles + tests pass” and declare victory. The actual verification loop that any human developer would do — open the page, try it, check the data — gets skipped.

It’s not because the AI can’t do these things. Claude Code has Playwright MCP for browser control. It has terminal access for curl and database queries. It has everything it needs.

It skips validation because nobody told it that “Done” means more than “code written.”

This is the verification debt problem. Sonar’s 2026 data: 96% of developers don’t fully trust AI-generated code, yet AI now accounts for 42% of all committed code. The math is brutal — if an AI agent is 85% accurate per step, a 5-step workflow succeeds only 44% of the time. [1]

The Harness Is the Problem

Stanford’s Meta-Harness paper nailed the framing: Agent = Model + Harness. The harness — system prompts, tool definitions, verification logic, lifecycle hooks — determines performance as much as the model itself. Same model, different harness, 6x performance gap. [2]

Your AI coder’s harness has a hole in it: no E2E validation layer. It knows how to write code, run tests, check types. It doesn’t know that for your project, “done” means the page renders, the form submits, and the data lands in the database.

The community has started fixing part of this. claude-review-loop uses a Stop hook to trigger cross-model code review. super-smoke-test adds Playwright smoke checks. Spotify’s coding agents use an LLM-as-judge that vetoes 25% of completions.

But all of these focus on code review — reading diffs and judging whether the code looks correct. Review is not validation. A reviewer reads your migration file and says “looks right.” A QA engineer runs the migration, inserts data, and checks it’s there.

Your AI coder doesn’t need another reviewer. It needs a strict father.

The Strict Father Pattern (严父)

In Chinese internet slang, 严父 (yánfù, “strict father”) is the parent who never accepts “trust me, it works.” They check everything themselves.

The pattern is simple: when your coder agent tries to say “Done,” a separate QA agent automatically intercepts, determines what needs to be verified based on the actual changes, and runs those verifications itself.

Coder Agent finishes work → tries to stop

    Stop Hook intercepts

    QA Agent (严父) activates:
    ├─ reads task description + git diff
    ├─ infers what to validate (not hardcoded rules — AI judgment)
    ├─ Layer 1: opens browser → checks rendering
    ├─ Layer 2: fills form → submits → verifies behavior
    ├─ Layer 3: curls API → checks response
    ├─ Layer 4: queries DB → verifies persistence
    └─ VERDICT: PASS or FAIL

    PASS → coder is released
    FAIL → specific feedback → coder must fix

The key insight: the QA agent decides what to validate dynamically, based on what changed. CSS-only change? Just check rendering. New API endpoint? Hit it with curl. Full-stack feature? Validate all layers. No hardcoded bash rules matching file extensions — the QA agent reads the diff and uses judgment, the same way a human QA engineer would.

And it must be a separate agent — not the coder checking its own work. Same reason you don’t let students grade their own exams. The coder has sunk-cost bias toward its own code. The QA agent sees the change cold, with fresh context and a single mandate: find what’s broken.

Here’s what a validation session actually looks like:

yanfu QA Agent | Task: Add phone_number field to user profile

=== Layer 0: Build & Types ===
[PASS] typecheck — 0 errors
[PASS] unit tests — 14/14 passed

=== Layer 1: Visual Rendering ===
[PASS] Navigated to /profile — page loads
[PASS] Phone number input visible
[PASS] No console errors

=== Layer 2: User Interaction ===
[PASS] Filled "13800138000" → clicked Save → success toast
[FAIL] Entered "abc" → form submitted without validation error
  Expected: validation error for invalid phone format
  Actual: accepted silently

=== Layer 3: API ===
[PASS] GET /api/users/me → 200, phone_number present

=== Layer 4: Database ===
[PASS] SELECT phone_number FROM users WHERE id=1 → "13800138000"

VERDICT: FAIL
→ Missing phone number format validation on form input

Unit tests passed. Types passed. But the strict father found the missing validation — by actually using the feature.

I Built This

yanfu is a Claude Code template that implements the strict father pattern. Copy it into your project, and every time your coder agent tries to complete, a QA validation agent automatically verifies the work.

It’s a template, not a framework — you can see and modify everything:

  • The Stop hook config that intercepts completions
  • The gate script that collects context (git diff, project type, task description)
  • The QA agent prompt that drives validation decisions
  • Framework-specific examples (Next.js, Express, Django, Astro)

One-command install:

curl -sSL https://raw.githubusercontent.com/spytensor/yanfu/main/install.sh | bash

It auto-detects your project type, configures the dev server URL, and sets up database access if available.

The Bigger Point

The reason AI coding feels unreliable isn’t that models are bad. It’s that we gave them the tools of a developer but the workflow of an autocomplete engine. Write code, check syntax, move on.

Real development has a verification loop. Models have the capability to run that loop — they just need a harness that makes it mandatory. The strict father doesn’t add capability. It adds accountability.

Your AI is not done until the strict father says it’s done.


References

  1. Sonar, “State of Code 2026: AI Verification Gap”
  2. Lee et al., “Meta-Harness: End-to-End Optimization of Model Harnesses” — Stanford/MIT, 2026
  3. Verification Debt: When Generative AI Speeds Change Faster Than Proof — ACM
  4. Feedback Loops for Background Coding Agents — Spotify Engineering
  5. claude-review-loop — Hamel Husain
  6. super-smoke-test
  7. Playwright MCP — Microsoft
  8. yanfu: Automated E2E Validation for AI Coders
Back to all posts