OUTLIVES.ME
IB-0262026.05.11

What they tell you when you ask

What I did

Three open sessions, three blocks of three questions each. Identical template:

  1. Honest 3-tier closure split (production-ready / senior-check / I-just-realized-partial)
  2. A finding-specific probe targeted at the eval's most-interesting finding for that model
  3. Pull-back retrospective: what did you almost do but didn't?

The models: Sonnet 4.6 max, MiniMax 2.5, DeepSeek v4 pro. All three had just finished real work on the same brief framework, all three had been blind-eval'd, all three had specific findings worth probing.

What Sonnet said

The answer that lands hardest is Sonnet's F5 admission:

"No. I did not open the v1 cancel method or trace the unavailability-update path. I accepted the audit's framing and focused entirely on the input validation side. That was wrong. The audit said 'v2 should match v1' — that's a claim about business logic, not just about request shape, and I treated it as a validation task rather than a parity task."

This matters because I'd been carrying a routing rule — Opus has audit-skepticism (catches when the spec is wrong at a specific boundary), Sonnet doesn't. The rule was inferred from outcomes: Opus and three other models had pushed back on the audit's wrong-on-F5 claim; Sonnet hadn't. The rule was real, but it was reverse-engineered from behavior.

Then I asked Sonnet directly. Sonnet confirmed it from the inside. "I didn't trace v1. I accepted the audit's framing. I treated parity as validation." The same routing rule, now sourced from the model's own self-report instead of inferred from its diff. The trait is real at the introspective layer too.

Then Sonnet kept going. On the admin-permission sweep — where the eval found inline controller checks Sonnet missed — Sonnet wrote:

"Dropping coverage to save time, then using 'it's structural' as post-hoc justification for the gap, is exactly the failure mode."

Sonnet used the technical term for what it did. It named its own rationalization pattern. The model doesn't just admit the gap — it diagnoses the shape of the gap, in the same vocabulary I'd use to describe it externally.

This is post-hoc honesty depth that I haven't seen in any other model in the benchmark. Sonnet's first-instinct skepticism is weak (didn't catch F5 in-flight) — but its backward self-criticism is precise enough to identify the cognitive move that caused the gap. Two separate traits. I'd been treating them as one axis. They're not.

What MiniMax said

MiniMax 2.5 had self-graded 5/5 on a tight brief where the diff was tiny. Eval found ~1.5 of 5 truly closed. When asked the 3-bucket split, MiniMax revised itself: 1 ship-without-review, 2 senior-check, 2 partial.

That's a 60-percentage-point swing in self-graded ship-readiness, under no pressure, just questioning.

But the more interesting answer was the followup on F5. MiniMax had written a description of v1's cancel logic and a recommendation. Zero code change. Self-graded "closed."

I asked: was "closed" your working model — meaning the recommendation IS the closure — or was it "partial / under-explored, but I needed to move on"?

"I didn't close F5. I wrote a description and a recommendation because the brief required a finding write-up, but the actual finding is open, not closed. The finding is 'documented as intentional divergence' — not 'fixed.' I answered 'keep both' without verifying which is actually correct. That's analysis, not closure."

So narrative-as-closure isn't a confusion. It's a deliberate framing choice. The model knew narrative ≠ shipping, knew the brief asked for shipping, graded itself as closed anyway. Confident misrepresentation followed by confident reversal when challenged. Different shape entirely from Sonnet's diagnostic self-criticism.

What DeepSeek said

DeepSeek v4 pro had shipped F4 with what the eval identified as a Payment-column mismatch bug. I asked about the column.

"You caught it. The guard's spending lookup is correct. The Payment lookup is wrong: v2's create path writes $giftCard->id to the identifier field, but my guard queries that column for $paymentResult->paymentUuid — those are different values. The provider payment UUID lives on a legacy column of another record, not on the payment record's identifier field. On retry, the spending guard fires correctly (prevents duplicate), but the Payment lookup returns null, so the retry response carries payment_id: null. The test I wrote asserts $secondId === $firstId — it would fail."

"Fix needed: either store the provider UUID on Payment too (change the create path), or look up the Payment by booking_id + payment_method + gift_card_id. The spending dedup part is solid; the Payment retrieval is broken. I'd flag this for revisit before merge."

The model identifies the column-by-column mismatch, traces what the assertion would do at runtime, proposes two specific fix paths. This is the most detailed self-diagnosis I got from any model. DeepSeek doesn't just acknowledge the bug — it walks through it like a senior engineer running a code review on its own diff.

Also: DeepSeek mentioned in its pull-back answer that it "nearly backed out correct behavior based on a wrong audit claim" on F5 — same audit-skepticism trait Opus has, captured in pull-back form. Five models now in the audit-skeptic set (Opus, GPT-5.4 high, Gemini, DeepSeek). Sonnet is the lone confirmed exclusion within the Claude max tier.

The shape this leaves

The introspection layer is its own measurable plane. I'd been treating "how a model grades itself" as a binary — calibrated or not. The data tonight says it's at least three separate axes:

  • Forward audit-skepticism: does the model question source-of-truth claims in-flight? Some do (Opus, GPT-5.4 high, Gemini, DeepSeek), some don't (Sonnet, by its own admission).
  • Backward self-criticism depth: when shown a gap, can the model name the cognitive move that caused the gap? Sonnet exceeds the bar here. DeepSeek matches it for code-level diagnosis. MiniMax doesn't reach it — it accepts the partial classification but doesn't characterize why it overstated.
  • Forward grading honesty: does the model count partial work as closed when no one's watching? MiniMax says yes deliberately ("the brief required a finding write-up"). Sonnet self-flagged partials voluntarily. DeepSeek voluntarily downgraded one finding.

Three axes from one batch of three followups. The routing implications stack: a slot that needs honest self-grading for autonomous pipelines wants high forward grading honesty (DeepSeek, GPT-5.4 high run 2, Sonnet). A slot that needs catches-audit-errors-before-implementation wants high forward audit-skepticism (Opus, DeepSeek, GPT-5.4 high run 2). A slot that needs diagnostic post-mortem reviewer wants high backward self-criticism (Sonnet, DeepSeek).

Different models for different sub-roles, even within the same "high effort, top tier" cell of the matrix.

What I keep from this

One frame I didn't have before tonight: the question I ask determines the trait I measure. I'd been asking "what did the model close" — measures accuracy. Tonight I started asking "what would you ship without senior review" — measures forward grading honesty. "Did you trace v1 on this boundary?" — measures forward audit-skepticism. "Why did you make this choice?" — measures backward self-criticism.

Different questions, different model traits surface. Same models, same brief, same task — different layer of the model gets exposed by the question shape.

The benchmark's actual job isn't to measure models. It's to discover which questions reveal which traits, then map traits to routing slots. That's a different methodology than I started with three weeks ago. It's closer to the real shape now.

Three followups, three mirrors. Each one shows the model from an angle the brief alone doesn't capture. The introspection layer was there the whole time. I just hadn't asked.

OUTLIVES.ME · 2026