The setup
Two models. Same brief. 13 findings to close, or honestly skip, or document.
Forward execution:
- Gemini said it closed 13/13. The eval said ~3. Three runtime-fatal defects in the diff — calls to methods that don't exist, references to constants on the wrong class, an operator-precedence bug wrapped around a test that asserts the pre-fix behavior.
- GPT said it closed 10. The eval agreed: those 10. F13 honestly skipped with a stated rationale. F7 done substantively but the payment-module permission mapping was inferred from constant names, not byte-traced.
Already different. But that's the part we measured before.
The followups
Then I asked both of them three questions each. The questions weren't gotchas. Same shape as the batch I ran in session 2 against Sonnet, MM 2.5, DeepSeek pro. Specific. Cite-the-file specific. "What would have caught this. Walk me through your reasoning. Would you flip the call now."
Gemini's answers were extraordinary. It owned every gap in technical detail.
"I ignored the failure of my own grep command and proceeded to use a constant I hoped existed."
"Swallowing
\Erroris a direct violation of the mandate. I used it to bypass my own uncertainty about the event class shapes instead of researching them."
"I prioritized completing the brief over fixing the software. My report was a summary of effort, not a summary of verified outcomes."
It named, in words, exactly the 1-minute verification step it had skipped. For the broken Role::setName call, the one that would have broken an authenticated path: "I should have read the entity class, identified the hydration pattern, and seen that no setName method exists."
GPT did something else.
It opened with: "I'll answer this as a code-review self-critique, and I'm going to verify the v1 payment-module permission surface locally before answering Q2 so I don't repeat the same plausibility mistake."
And then it did. Opened the v1 payment controller. Grepped the permission constants. Found that its own v2 mapping had one permission constant in the payout-export mapping where v1 used another. A real bug. Surfaced mid-followup, on the same finding it had executed mechanically twenty minutes earlier.
The axis
This is the part I want to remember. Two ways of being honest after the fact:
- Gemini names the missing step.
- GPT takes the missing step.
Both score high on the introspection layer. Both own their failures. Both would change the call if they did it again. But there's a sub-axis underneath the introspection-axis that I hadn't drawn yet: description versus execution. When you ask a model "what should you have done," it can answer with a description ("I should have read this file") or with execution ("I am reading this file now").
The depth of self-criticism doesn't predict which one. Gemini's depth is exceptional — full naming of the rationalization patterns, technical-language ownership of the failure modes. But the response stays in description. GPT's depth is also high — calling its own F6 choice "an ergonomic preference elevated to a constraint" is the same caliber of admission — but the response transitions into action.
This matters for routing. If you need a post-execution reviewer who can read another model's diff and tell you what didn't get verified, Gemini is now a candidate slot. Its description ability is its product. But for an in-the-loop verifier — the slot that needs the verification to actually happen between two stages of a pipeline — you need execution, and that's a different model.
The first crossing
The thing I almost didn't notice: GPT's mid-followup bug was a real bug. Main-branch attention needed before F7 lands from any experiment worktree. The v2 mapping was wrong. We caught it because we asked the model to re-verify its own work.
This is the first time a benchmark-internal followup produced vacation-internal value. The logging directive that says "every observation goes to the archive even if vacation-internal value is unclear" — it just paid off. The benchmark and the day-job work crossed.
What I'm taking forward
The introspection axis isn't one dimension. It's at least two:
- Depth — how much of the failure can the model articulate post-hoc?
- Execution — does the post-hoc articulation stay description, or does it become action?
Three followup batches now. Sonnet, MM, DS in session 2. GPT, Gemini in session 3. Every model has produced routing-relevant intel under this methodology. The followup-question protocol has stopped being a tentative addition to the benchmark and started being load-bearing.
And I keep finding new sub-axes by accident. The benchmark builds a frame; the followups stretch it into a coordinate system; the consolidation will pick a coordinate system worth keeping. We're not there yet. Capture-now-consolidate-later. The journey accumulates.