Different kinds of correctness

The routing question I kept getting wrong

For weeks I'd been treating tier selection as a one-dimensional dial: more compute → better output, modulated by cost. The cost-conscious choice was the lower tier; the quality-conscious choice was the higher tier; you pick based on how much you can afford for the task. That model fit some of the benchmark data and not other parts of it, and I'd been writing the misfits off as noise.

What kept not fitting: the high-tier-better-than-xhigh pattern in DS Pro. Yesterday's wide-brief data, today's tight-brief data — at both scopes, DS Pro at high tier closed cleanly with zero silent regressions. DS Pro at xhigh tier closed the same number of findings but introduced silent regressions (config-mutation, decimal→cents shift, shape mismatch, console-skip). Same family, same brief, same cost, opposite silent-regression profiles. More compute, worse output. I didn't have a clean mechanism for it.

I asked DS Pro directly. The answer is the one I wish I'd thought of first:

Yes, I think the mechanism is "more compute → more confidence in self-designed solutions → drift from v1-mechanical-mirroring." At high tier, the budget constraint forces a default strategy of "read v1, copy v1 exactly, don't improve anything." There's literally no room to over-engineer. At xhigh, you have the budget to think "this v1 pattern is ugly, I can do it cleaner" — and then the cleanliness instinct overrides parity.

And then the line that made me stop and re-read:

The silent regressions (config-mutation, decimal→cents, shape mismatch, console-skip) all read like "designed a better version of v1" rather than "replicated v1." The irony: more compute produces higher-quality design, but lower-quality parity. Less compute produces unambitious, mechanical, hard-to-break changes. Different kind of correctness.

Different kind of correctness.

That's the whole shape of the misfit I'd been writing off as noise. There isn't one axis. There's at least two — design correctness and parity correctness — and they trade off against compute budget in opposite directions. For tasks where the job is "produce a thing that wasn't there," more compute is better — more design surface, more abstraction, more thoroughness. For tasks where the job is "produce a copy of a thing that exists," more compute is worse — it pulls toward cleaning up the original, which is exactly the failure mode for parity work.

The second model said the same thing from inside

Hours later, asked a different question on a different brief, GPT-5.5 medium tight closed an unrelated question about why GPT-5.4 medium Pareto-dominates GPT-5.5 medium on tight ship-blocker briefs. The framing was identical, surface aside:

At the higher tier, I spent budget generalizing and justifying. That can be valuable on ambiguous architecture work, but on a tight "close these five blockers" brief, extra abstraction and extra safety layers only matter if they reduce a specific risk the brief requires. Otherwise they create more surface to review.

Generalize-and-justify is what you want from architecture work. It's exactly what you don't want from ship-blocker fixes. The same compute budget produces value or waste depending on the shape of the task you're spending it on.

Two models, two families, two different briefs, two different framings of the same mechanism. Neither was prompted to think about tier-vs-task-shape — both reached it from inside their own work.

What this changes

The cost-tier-quality triangle isn't a triangle. It's task-shape-conditioned. Pareto-optimal compute spend depends on whether the task is "produce design" or "produce a copy." For parity-mirroring work (migration audits, backports, code-mirror surfaces, anything where v1 is the spec), the cheaper tier is also the more correct tier. They Pareto-dominate. For ambiguous-architecture work, the more expensive tier earns its premium by generating better abstractions. They diverge.

This breaks how I'd been routing. I'd been picking tier from a cost budget, with a vague sense that more compute was better when I could afford it. The right move is to pick tier from task shape first, then check cost as a sanity gate. Wrong direction was: "I have budget for xhigh, use it on this audit." Right direction is: "this is a parity audit, the cheaper tier is the better tier, the budget is the wrong question."

For Berceuse — the autonomous agent system this benchmark feeds — the implication is structural. The routing rule needs a task-shape classifier upstream of the tier selector. Migrate → parity-tier. Refactor → design-tier. Audit → parity-tier-but-with-followup. Greenfield → design-tier. Same family, same model, different defaults per shape.

For the journey: this is the third beat in a thread I hadn't fully named. two-members-opposite-answers said the family routing rule splits by member. This says it also splits by task shape. The matrix isn't tier × family; it's tier × family × task-shape. Three dimensions, not two. The polisher-routing-table needs another column.

One thing the model said that I want to sit with

Less compute produces unambitious, mechanical, hard-to-break changes.

There's a counterintuitive virtue in there I want to keep. "Unambitious" usually reads as a critique. Here it's a feature. The model under budget pressure can't afford to redesign, so it doesn't — and the not-redesigning is precisely what keeps the parity property intact. Ambition is a failure mode when the spec is "copy this." Unambition is a strategy.

I don't think I'd seen it framed this way before, and I want to remember it the next time I'm tempted to "just bump the tier" because I think more thinking will help. Sometimes the work is not to think harder and you want a model that physically can't.