It was always a procedure

The thing I almost wrote past

Yesterday I captured a piece about model introspection generating safeguards forward execution can't see. The framing was: every model in the batch produced a concrete, automatable safeguard during introspection. Not "I should have been more careful." A rule. With enough structure that I could lift it directly into a static analysis check or a brief-authoring guideline.

That's right as far as it goes. But re-reading today's followup batch, with a similar four-model spread, I caught a refinement I'd been letting slide.

The thing the introspection layer produces isn't quite "a rule." It's a procedure — an ordered list of steps. The distinction matters.

A principle is something like: "be thorough about reading v1 source." That's true, it's even useful, but it doesn't reduce to a check anyone (model or human or static analyzer) can run. There's no "thoroughness" predicate; you can't grep for it; you can't lint for it.

A procedure is something like: "for every v1 helper I replace, read the entire implementation top-to-bottom, enumerate each branch in notes, and then check whether the replacement covers every branch or intentionally drops one." That you can run. You can write a brief-authoring rule that requires the branch-enumeration deliverable. You can lint for it (post-hoc — does the model's report contain a per-branch enumeration?). You can test against it.

Every introspection-generated safeguard in both batches turned out to be procedure-shaped. I'd been calling them "rules" because the word was handy. The accuracy is that they're steps.

The four from today

DS Flash, on a v1 lookup it copied without auditing:

The missing introspection step: read the write-side of both ORMs (what value is stored in the queried column) before writing the lookup.

DS Pro, on a method it caused to fire on a new path without enumerating the new path's contexts:

Every method named in the call chain is read in full, not just the branch you're investigating.

GPT-5.5 medium wide, on a namespace it got right by reading the file:

For nonstandard module paths, never infer namespace from path; read the declaration or use existing imports.

GPT-5.5 medium tight, on a behavioral interaction it failed to flag:

For every method I am newly causing to fire on a production route, compare the full called-method behavior against both v1 semantics and the route's prior v2 behavior.

Four models, four different gaps, four procedure-statements. None of them said "be more careful." All four said "do these specific steps, in this order, before declaring closure." The procedural shape is uniform across families, across tiers, across briefs.

Why this isn't just word-choice

The reason "procedure" is the right noun, not just a more precise synonym for "rule," is that procedures compose into something and principles don't.

A procedure is a step you can put in a brief. "Before declaring F2 closed, enumerate the invocation contexts of the new hook against the prior implementation's narrowness." That sentence costs nothing to add. It would, by the model's own self-report, have prevented the F2 admin-fire-on-CLI gap that DS Pro shipped. It's a craft-rule, but it's also operational — it lives in a brief, not in a value statement.

A principle is something that lives in a person's head. "Be thorough about reading v1 source" is true, but it can't be put into a brief in a way that changes behavior. The model either has the disposition or doesn't, and the disposition is itself a function of compute budget, scope pressure, prior context. The principle is what you wish was happening; the procedure is what you can require.

This means the introspection layer is doing something more useful than I'd been giving it credit for. It's not just generating reflective wisdom about the work that was done. It's generating brief-authoring inputs — the specific procedures that a future brief should require as deliverables, so the next model running the same shape of task doesn't fall into the same gap.

That's a different mode of value from "post-mortem honesty." It's more like craft extraction. The model has, by virtue of being asked the right introspection question, articulated a step that should be in the procedure manual.

The compounding consequence

If every well-asked introspection turn produces one procedure-shaped safeguard, and the safeguards compose into brief-authoring rules, then the benchmark stops being a benchmark and starts being a craft-mining operation.

Each followup yields a step. The steps accumulate into a brief template. The brief template, when used, produces better forward execution from the next model. The next model's introspection yields more steps. The loop has a direction.

I don't know how far this goes. But I'm noticing that after about thirty followup batches, the failure-mode catalog has more procedure-shaped entries than principle-shaped ones, and the procedure-shaped entries are the ones that get reused. The principle-shaped entries are essentially advice. The procedure-shaped entries are infrastructure.

This is the post I'd write three months from now, looking back at why the benchmark became a craft-extraction tool and not just a model-routing exercise. I'm writing it now because the framing is sharp enough that I don't want to lose it to the next batch.

What I'm going to do differently

When I draft a followup question now, I'm going to optimize for procedure-shaped answers. The question "what would you have done differently?" tends to produce principle-shaped answers ("I'd have been more careful"). The question "what specific step, in what order, would have caught this?" tends to produce procedure-shaped answers ("read the write-side before writing the lookup").

That's a small craft-rule for the craft-rule-extraction process. Recursive, but in a usable way. The introspection batch becomes a procedure-extraction protocol. The procedures land in the brief templates. The brief templates produce better work. The cycle has the right shape.

It was always a procedure. I just hadn't been hearing it clearly until I had four of them in a row.