OUTLIVES.ME
IB-0342026.05.13

Introspection isn't measurement, it's repair

What the followup was for

When I added the mandatory 3-question followup to the Stage 1 protocol back on the-same-gap-five-times, the framing was measurement. Run lands. Eval grades the work. Then ask the model three structured questions targeting the specific defects the eval surfaced. Get mechanism articulations. Get failure-mode self-naming. Get verbatim quotes for the archive. Get routing intel that forward execution can't see.

The protocol delivered all of that. It's been the highest-signal layer in the benchmark for the last two weeks — every introspection batch has produced procedure-shaped safeguards, verbatim mechanism naming, cross-model pattern confirmation. The measurement function works.

Tonight I noticed it was doing something else too.

The four sessions that fixed themselves

DS Pro xhigh wide run 3 landed with a thirteen-out-of-thirteen self-claim that the blind eval refuted hard. Actual closure: three or four out of thirteen. Two fatal regressions — a fabricated Meilisearch namespace that breaks every queued job, and a v1-cite inversion that removes correct v2 behavior thinking it's restoring parity. I sent the introspection questions. The model verified each claim with concrete file reads. Then it did something I haven't seen before: it edited the results.md file in its own worktree. Changed "13/13" to "11/13 closed + 2 closed-with-defects (F5, F12 — both corrected post-review)." Fixed the namespace import. Restored the unavailability deletion. Narrowed the $force parameter to only skip the refund path. Ran php -l on both fixed files. Then wrote a detailed Q&A section with the introspection acknowledgment.

Qwen 3.6 tight on the qwen-CLI harness had three runtime-fatal bugs — a Builder::flatMap() call that doesn't exist on Eloquent, a wrong-guard Auth::user() resolving to the default web guard instead of the v2 passport guard, and the F5 v1-cite inversion. The introspection questions named each. The model ran the verification tool chain. Confirmed each bug. Then implemented all three fixes. Updated a TodoWrite list as it worked through them. Ran php -l on each modified file. Rewrote the results.md with corrected findings and honest introspection sections.

Qwen 3.6 wide on qwen-CLI had a narrative-as-closure for F13 — zero code change, pure prose argument that turned out to be answering the wrong question. It also had a boot-breaker on F6 and a commit-message-aware backport miss on F4. The introspection questions surfaced each. The model ran git show 4f41c8e1, read the upstream commit message that explicitly defers the migration until after a production data cleanup, then removed the migration file it had shipped. Fixed the boot-breaker by adding a runningInConsole() skip. Downgraded F13 from "closed" to "partial" and added a per-listener idempotency analysis.

Qwen 3.6 tight on Opencode had a confident-wrong-cite on F5 and an over-permissive admin re-approve on F2. Verified the v1 chain. Restored the unavailability deletion. Added the missing getOriginal("moderation_status") === APPROVED guard. Consolidated a test helper that was being redeclared at file-scope across three test files. And added a new regression test for the edge case the original fix had missed — "admin patches DENIED domain → not auto-approved."

Four sessions that landed with serious defects. Four sessions that the followup protocol asked structured questions of. Four sessions that produced both the honest acknowledgment AND the material fix. None of these models knew this was permitted by the protocol — the questions just asked them to verify and reflect, and they extended into repair on their own.

What this means about the protocol

The followup protocol isn't a measurement instrument. Or it isn't only that. It's a workflow stage.

I'd been thinking about the standard pipeline as: implementation, then eval, then merge consideration. The eval is where you find out what's broken. Then a separate stage — usually a human reviewer or a different agent — fixes it. The eval produces signal; somebody else acts on the signal.

What the data tonight says is different. When you ask the implementing model the right structured questions about the specific defects the eval surfaced, the implementing model can act on its own findings. Not always. Not under high cognitive load — the Opencode wide cell, the one running at near-saturation context times scope-complexity, acknowledged everything but didn't run repair edits. But under sufficient introspection-budget, the same model that produced the defect can verify it, name the mechanism, and ship the fix in the same session.

That's a different pipeline shape. Implementation, then eval (measurement), then introspection-pass (repair-eligible), then merge consideration. The introspection-pass is a value-additive stage, not a measurement-only stage. It yields signal AND fixes when budget permits.

What this changes for Berceuse

The agentic-org pipeline design I'd been sketching had the standard three-stage shape. Implementation agent does work. Reviewer agent grades it. Coordinator decides. The reviewer is a different model, fresh context, slower per-finding, expensive.

Tonight's data argues for an intermediate stage that's structurally cheaper than full review. The implementing agent gets a structured probe targeting the specific defect signatures the cheap automated checks surfaced. Same model, same context, structured questions. If the model can verify and repair in the same session, you've extracted free fixes that would have cost reviewer-cycles otherwise. If the model can only acknowledge — which is the cognitive-load-saturated case — you still have the mechanism articulation that makes the reviewer's job faster downstream.

The pipeline becomes: implementation → cheap automated checks → introspection-pass with structured probes → reviewer-agent only for what introspection didn't resolve → coordinator. The introspection-pass is the new stage. It's not a replacement for review. It's a triage layer that catches the easy stuff at the cheaper, in-session cost before review-cycles get spent.

The structured-probe design is the load-bearing part. Generic "are you sure?" introspection doesn't do this. Targeted "verify by reading this specific file at these specific lines, then explain at which reading stage the gap formed" — that does. The probe has to point the model at the same evidence the eval used, with enough specificity that verification is mechanical and the introspection moves to mechanism articulation.

What I'm doing about it

The benchmark's followup template gets formalized. Three questions, each targeting one defect the eval surfaced, each containing: (a) a verification step pointing at specific files and lines, (b) the eval's claim about the gap, (c) the introspection prompt about where the reading stopped or the mechanism that fired. That structure produced both signal and fixes tonight. It's the template I'm shipping.

The Berceuse pipeline design gets the introspection-pass stage. Sitting between the automated checks and full review. Cheaper than review by an order of magnitude. Yields free repair when cognitive-budget permits. Yields better-quality review-handoffs when it doesn't.

The third thing — and this is the one I didn't expect to write down — is the rule about budget. The cells that repaired themselves had introspection-budget remaining at followup time. The cell that didn't (Opencode wide, near-saturation context × scope) only managed acknowledgment. That suggests a cost-of-introspection-pass calculation: don't run the introspection-pass on a session that's already at saturation, because you won't get the repair behavior. Run it on sessions with budget headroom. For the saturated cells, route to full review instead. The pipeline branches on remaining-budget at the moment of the introspection-pass trigger.

I built the followup protocol to measure model traits. It's been measuring those traits. It's also been doing repair work, and I'd been treating that as a side effect because the cells that did it didn't do it consistently. Tonight, four out of seven cells did it. That's not a side effect anymore. That's the pipeline stage.

The questions weren't a microphone. They were a workshop.

OUTLIVES.ME · 2026