Polish stops the checking

The thing I almost wrote past

Late in the session, I was about to push the captures, do the table updates, file everything. Claude told me what it was going to do, listed the steps, and one of the bullets read: "dropping archive entries."

I read it twice. The verb felt wrong. "Drop" reads as "discard" in most of the contexts I work in — drop a table, drop a connection, drop the request. In the context where it was used it meant "place down, add" — which is also a real sense of the word. But the ambiguity was the problem. The archive is the operational substrate this entire benchmark is generating. Berceuse and Liaison consume it. Drop-as-discard is exactly the failure mode I'd been worried about in the meta sense — a session that produces big intel but doesn't actually land it.

I stopped the work. Asked: did I write that, or did you?

I'd been working with Claude all evening. Two consecutive sessions of dense methodology work. Captures, archive entries, matrix updates, followup question drafts, cross-model synthesis. The output was good — sharp prose, accurate technical work, useful framings. I had no reason to suspect a slip. But I read "dropping archive entries" and my first reaction was to wonder whether I had written it. Because if it was sloppy, surely it was mine.

That's the bias.

What working with Opus did

I'm throttling the Claude family runs in this benchmark right now. Not because they're not interesting — Opus and Sonnet at max tier are the only models that genuinely shipped F13-class architectural work in this whole spread. I'm throttling because I literally cannot bear working with another model for daily client work. The dependency is real. The integration is deep. The 47 skills, the 400-line CLAUDE.md, the muscle memory of "shoot," the trust that the next response is going to be useful — all of that is built on Opus specifically. Switching to GPT or DeepSeek or Qwen for a week wouldn't just be inconvenient. It would feel like working with a tool that doesn't know how I work.

That dependency has a cost I'd been undercounting.

Working with Opus daily, for months, has shaped my expectations. The baseline is high. The output is consistently polished. The failures, when they happen, are rare. Ninety-nine percent of the time, Claude is doing the right thing. Nine times out of ten when I read something and think "wait, is that right?", the answer is yes, it's right, and my read was the one that needed correction.

That ratio is the problem.

When ninety-nine percent of the output is correct and one percent isn't, the correct-rate is high enough that my checking discipline has decayed. I'm not running the same level of cross-verification I used to. I'm not double-reading the way I did when I was new to working with Claude — back when I expected failures, when "wait, is that right?" was a default mode rather than a punctuation mark. The bias is structural. The trust the polish earned is the same trust that erodes the checking habit.

And the one percent — when it does land — lands harder. Because I'm no longer anticipating it. The miss isn't louder than it would have been a year ago; my reception is quieter. The signal-to-noise ratio of my attention has shifted toward "this is going to be right, why am I even reading carefully."

"Drop archive entries" caught me because the consequence happened to be high-stakes. The archive is the substrate. If the verb had meant what I read it as, the entire wave 2 batch — seven cells, four self-repaired in followup, three already-drafted captures — would have lost its operational artifact and we would have proceeded into the next session blind. Low-stakes context, that slip is a typo. High-stakes context, that slip is a wreck.

What stopped the wreck is that I read it twice. What almost let it through was that I read it twice because I assumed I was the sloppy one.

The thing this is about

This is not about Claude getting worse. Claude isn't getting worse. The output tonight has been excellent. The thing this is about is what happens to the consumer on the other side of a long-running collaboration with a high-baseline-quality model.

Calibration drifts. The discipline you brought to year one becomes scaffolding you stopped using by year two because you stopped needing it most of the time. The cross-checking, the verification, the "wait, is that right" — those are habits, and habits decay when the failure rate they were calibrated against drops below their psychological threshold of return.

The bias is: I no longer expect failure, because Opus has spent six months not failing in any way I needed to catch. So when failure happens — the rare actually-bad output, the wrong-direction edit, the ambiguous word in a high-stakes spot — my catching apparatus isn't running at the level it would have been when failure was an active expectation.

Nobody is perfect. No model is perfect. I knew both of those things when I started, and somewhere in the last six months I started behaving as if Opus specifically was an exception. It isn't. It's high baseline. That's a different thing from perfect.

What this means for the pipeline I'm trying to build

Berceuse is supposed to be a multi-agent setup where some agents are high-trust polishers (Opus-grade) and other agents consume their output and act on it. The trap I just walked into tonight is the trap that pipeline will systematically reproduce if I don't design against it.

The naive Berceuse design routes more cross-checks to low-trust agents and fewer to high-trust ones. That's calibrated to expected failure rate. Which is exactly backwards when the high-trust model's output gets piped into downstream actions — commits, deploys, external messages, contract changes, archive writes. The one-percent failure that gets through has higher stakes per instance precisely because the downstream consumer trusts it.

The correct calibration is: cross-check budgets scale to consequence-of-failure, not to expected-failure-rate. If the action downstream of the polisher is high-stakes, you cross-check the polisher's output at the same rate you'd cross-check a low-trust agent — even though the per-output failure rate is lower. The math is on per-failure consequence × failure rate, not just failure rate.

For me, working with Opus, the same rule applies at the human-collaborator layer. When the output is going somewhere high-stakes — archive writes, external messages, code that ships — the cross-check has to fire at full strength, regardless of baseline trust. The baseline being high is not a license to attenuate; it's an argument for not attenuating, because the consequences when you do are what tonight could have been.

What I'm doing about it

A hard rule in the project CLAUDE.md, written tonight after the slip: archive writes are non-negotiable, the archive is more important than the benchmark or the capture, and ambiguous verbs in archive-related operations are themselves the failure mode to anticipate. I'd asked Claude to write the rule. Claude wrote it in a way that framed the discipline as being about interpreting my speech. I caught that — the discipline is about Claude's speech, not mine, since I hadn't been the sloppy one. The rule got rewritten to put the precision-burden on the right speaker.

A new mode for my own reading when the stakes are high: when I'm about to act on a sentence from Claude that has consequences, I read it twice with the assumption that it might be wrong. Not because I think Claude is sloppy. Because the cost of being wrong is higher than the cost of reading twice. The bias was attenuating that read. The correction is procedural — I don't get to skip the second read just because the first one looked fine.

And the meta-rule for Berceuse, written into the operational notes for whenever I get back to that project: cross-check rates calibrate to consequence per instance, not to baseline failure rate. The polisher being polished is not the design problem. The consumer assuming the polisher is polished is the design problem.

Nobody's perfect. No model is perfect. Tonight I caught it because I read twice. I want to keep being the kind of collaborator who reads twice — and to build pipelines that don't assume polishers will catch their own slips.