Why frontier AI still overreaches, and why the fix can't live inside the model.
Alexander Theruviparambil— Founder8 min read
I was running Opus 4.7 last week: maximum reasoning effort, a one-million-token context window, a task I'd specified down to the acceptance criteria. It told me the task was done.
It wasn't.
Not almost-done. Not done-with-caveats. Done-done, confidently, in the tone of someone handing over finished work. When I went to check, half the changes I'd asked for weren't there.
This isn't a new experience for anyone who uses these tools seriously. It's the thing that turns AI coding from wow into when. When do I have to check? When does the model hand me something that looks right but isn't? The models keep getting smarter, and the overreach keeps happening anyway.
My first instinct was to call this an instinct problem: something gut-level, even emotional, about the way the model pushes ahead. Instincts are fast and confident. Decisions are slow and deliberate. The AI, I thought, is acting on instinct. And decisions don't have emotion; instincts do. That's why the model sounds so sure of itself.
When I actually went to check that framing, I had it backwards.
Instinct, reconsidered
The cognitive scientist who's done the most work on how emotion fits into decision-making is Antonio Damasio. His patients with damage to the ventromedial prefrontal cortex had intact IQ, intact memory, and intact personality on the surface. They also had one thing missing: emotion. They could not make decisions. Given a menu, they'd deliberate for an hour. Given a business choice, they'd reason forever and choose badly or not at all. Damasio's line, from Descartes' Error, is one of the most-quoted in the field:
“
We are not thinking machines that feel; we are feeling machines that think.
”
Antonio Damasio, Descartes' Error (1994)
Emotion isn't what corrupts decision-making. Emotion is how decision-making works at all.
That's not what's happening inside a language model. And the analogy I just reached for is part of why. Damasio's patients lost embodied valence: the somatic feedback that lets a healthy person feel a bad option as bad before they deliberate about it. That isn't the gap I'm trying to name in an LLM. The gap is something different, and arguably more architectural: the absence of the check that good decision-making relies on. It's intuition without a second-guess. An LLM generates the next token, and the next, and the next, in a single confident stream. There is no point in that stream where a separate mechanism pauses and asks: do I actually know this? There is no monitoring layer. No circuit breaker.
Cognitive scientists call the thing that's missing metacognition: the ability to think about your own thinking, to notice when you don't know something. Humans do this constantly, cheaply, often without noticing. Ask me the capital of a country I've heard of but can't quite place; I'll say "I don't know" before I've finished the sentence, because something inside me signals that the lookup failed. LLMs don't have that signal. Or more precisely: they have something like it at the level of internal probabilities, but they can't reliably convert it into language. They don't reliably know what they don't know.
The architecture of confident wrongness
There's now a large body of research on this specific failure mode. Anthropic's "Language Models (Mostly) Know What They Know" showed that models have partial self-knowledge, but it breaks down under distribution shift, which is exactly when you need it most. OpenAI's own GPT-4 technical report contains one of the more uncomfortable figures in AI research: a calibration curve showing that the base model's stated confidence tracked real accuracy well, and then RLHF training flattened the curve. Making the model more helpful and more aligned made it less calibrated. It got better at sounding confident and worse at being right about it.
Figure 1 · Calibration
What happens to a model’s calibration after RLHF.
Base (pre-RLHF)After RLHFPerfect calibration
Stylized after Figure 8 of the GPT-4 Technical Report (OpenAI, 2023). The base model’s stated confidence tracks actual accuracy closely; after RLHF the same model is more persuasive and less right — stated confidence becomes a worse predictor of whether the answer is correct.
Chain-of-thought reasoning, which was supposed to be the cure, turns out to be part of the problem. Turpin and colleagues at NYU and Anthropic showed that when you bias a model's input, say by ordering multiple choice so the answer is always A, the model's answers shift accordingly, but the chain-of-thought never mentions the bias. The reasoning is generated after the answer is effectively chosen. It's narration, not deliberation.
And this is where the million-token context comes back. A larger context window doesn't add a monitoring layer. A longer response doesn't produce more verification. A smarter model, trained with more RLHF, actually gets worse at expressing calibrated uncertainty. The overreach I was seeing in Opus 4.7 at maximum effort wasn't effort-limited. It was architectural.
Put it another way: the model is pure intuition. It's System 1 without System 2, the associative, pattern-matching part of cognition without the slow, reflective part. Kahneman's two-systems framing was always a useful fiction rather than a literal brain claim, but it names something real, and the real thing here is: LLMs have the fast system and don't have the slow one.
The check has to live outside
Here's the uncomfortable implication. If the problem is that the model can't reliably check itself, then making the model better is not what fixes it. You can't paper over a missing layer by scaling up the layer that's there. The check has to live outside the model.
This is the shape of the problem every engineering team using AI-authored code is actually wrestling with. Your generator produces a pull request. Your reviewer is the same vendor's AI, often the same model family. The thing being reviewed and the thing reviewing it share an architecture, a training distribution, and a bias structure. That isn't an independent check. It's the same intuition reading its own output. When you ask Copilot's AI to review code written by Copilot's AI, you are asking the model to do something it structurally cannot: grade its own homework.
Figure 2 · Same-vendor review
same vendor · same training · same bias
same vendor · same training · same bias
AIGenerator AIwrites the PR
Pull request
AIReviewer AIgrades the PR
The model asked to grade the work is a sibling of the model that wrote it. That isn’t an independent check.
The enterprise buyers asking about this in 2026 have started naming it out loud. Can you tell which PRs were AI-authored? If the EU AI Act or your SOC 2 auditor asks how you verify AI output, what do you hand them? The answers to these questions don't exist inside the generator. They have to come from somewhere else: a layer that sits between the AI and production, that is not itself the AI, that does what the model architecturally can't.
What we're building
That layer is what we're building at Veriva. We don't generate code. We check it through a pipeline that runs static analysis, security scans, style and context validation, and policy gates against every change that enters it, AI-authored or not, with no conflict of interest from also being the author. We're not claiming the AI has metacognition. We're providing it, externally, as a system.
That's not a criticism of the models. I use them every day. I drafted the first pass of this post on top of a frontier model. The models are extraordinary, and they're going to get better. Some of those improvements will narrow the gap: better calibration, better verification-trained reasoning, better tool use. But narrowing the gap is not the same as closing it. Generating and checking are different cognitive operations, and a single forward pass of a transformer, no matter how well-trained, cannot reliably do the second.
Damasio's patients taught us that without the right monitoring layer, deliberation fails. For them, the fix wasn't more reasoning. It was reintegrating the thing that had been lost. For AI, we don't have the option of reintegration. We don't get to add a missing brain region. What we can do is build the monitoring layer alongside the model, outside it, as its own category of tool.
That's the shape of governance. That's what the picks and shovels look like in this gold rush. Every agent vendor, including Claude Code, Cursor, Devin, Codex, and Cognition, is racing to generate more code faster. Somebody has to check the work. The somebody can't be the same somebody.
When the model tells me a task is done, I want to believe it. The point of governance isn't to doubt the model. It's to make the belief safe.
Postscript, April 27, 2026.
The essay argues the model can't reliably check itself. Anthropic's April 23 post-mortem, published the same day this went live, describes the same failure one layer up. From March 4 to April 20, three independent regressions in Sonnet 4.6, Opus 4.6, and (briefly) Opus 4.7, all introduced by people trying to make the product better: a caching optimization, a latency fix, a verbosity instruction. Different teams, different intents, all silently degrading model output. None caught by internal evals.
“
Neither our internal usage nor evals initially reproduced the issues identified.
”
Anthropic, Claude Code Quality Issues: A Postmortem (2026)
That's the metacognition gap one tier up. The system built to evaluate the model shares the model's blind spots, calibrated to what the team expected, not what users were actually seeing. The signal that finally surfaced the regressions came from the only place it could: outside, from users posting reproducible examples publicly. The essay's frame for this was the same intuition reading its own output. The post-mortem is the vendor version of that: the same intuition reading its own product. The check has to live outside the system. Not just outside the model.