ProgramBench shows why AI coding needs executable acceptance criteria, not just stronger generators.
Alexander Theruviparambil— Founder10 min read
Every developer using coding agents knows the moment: the agent says it is done.
The tests are green. The diff looks plausible. The explanation sounds complete. The agent has a tidy summary of what changed, why it changed, and why the task is finished.
Then you inspect the code and realize the task is only half solved.
Not broken in an obvious way. Not useless. That would be easier. The harder failure is code that is mostly right, locally convincing, and operationally unsafe to trust. The implementation handles the common case but misses the weird one. It copies the shape of an existing abstraction but not the reason for it. It adds the check you asked for in one path and forgets the other two. It makes progress, and progress is exactly what makes the mistake harder to see.
That is the part of AI coding that still feels undernamed. The problem is not that models cannot write code. They can. The problem is that software teams do not merge "some progress." They merge behavior, ownership, tests, risk, and policy into a shared system. The agent can tell you the work is done. The question is what proves it.
A new benchmark called ProgramBench is the cleanest public benchmark I have seen for that gap.
A benchmark for the moment after done
ProgramBench asks a brutal question: given only a compiled program and its documentation, can a software engineering agent rebuild the source code project from scratch?
Not fix a bug. Not implement one well-specified feature. Not fill in a function body with tests waiting nearby. Rebuild a working program.
The benchmark includes 200 tasks, from compact command-line tools to widely used software such as FFmpeg, SQLite, PHP, jq, fzf, DuckDB, ripgrep, and zstd. The agent does not get the original source. It gets the executable and documentation. It has to infer behavior, decide on an architecture, write a codebase, and produce something that behaves like the reference program.
That setup matters because it removes a crutch most coding benchmarks still rely on. Many benchmarks prescribe the shape of the work. There is a file to edit, a failing test to satisfy, a small patch to land. ProgramBench moves closer to the work agents are starting to do in the real world: turn an intent into a functioning software system.
The results are stark. Across the evaluated tasks, no model fully solved any task. The best model passed at least 95 percent of tests on only 3 percent of tasks. The paper also reports that model-written codebases diverged sharply from human-written ones, tending toward monolithic, single-file implementations with longer functions.
That will be the headline: frontier agents got zero fully correct.
I think that is the least interesting part.
The wrong lesson
The lazy reading is "AI still cannot code."
That is not what ProgramBench shows. The paper shows something more specific and more useful. Agents can make meaningful partial progress on hard software tasks. They can infer behavior. They can write substantial code. They can pass many generated tests. They can get close enough that a human looking quickly might feel impressed.
But close enough is not the same thing as done.
Software is full of cliffs where the last five percent matters more than the first 95. A parser that handles most inputs but mishandles one escaping rule is not done. A database clone that passes common cases but violates a transaction edge case is not done. A security-sensitive workflow that works for the happy path but skips authorization during retry is not done.
This is why the benchmark is interesting. It puts pressure on the difference between progress and completion. Agentic software work will live in that gap for a long time.
The current generation of tools is very good at producing artifacts that look like progress: diffs, files, explanations, green tests, commit messages, summaries. What teams need next is the system that can say which artifacts are actually mergeable.
The important move is the harness
ProgramBench's most important contribution is not the leaderboard. It is the evaluation shape.
The benchmark uses the reference executable as the source of truth. It then builds behavioral tests by probing that executable. The candidate program is not judged by whether it looks like the original source. It is judged by whether it behaves like the reference.
That is the right abstraction. Code is not trusted because the generator sounds confident. Code is trusted because it survives an independent check against behavior the team actually cares about.
This is also where the benchmark gets philosophically honest. A finite test suite cannot prove full correctness. It can show failures, and it can increase confidence, but it cannot exhaust every input or every operational condition. Passing tests is not the same thing as being correct. It is evidence, not salvation.
That distinction is exactly what production engineering teams need. The future of AI coding will not be a binary world where agents are either useless or autonomous. It will be a world of evidence levels. How much do we know about this change? What behavior was checked? What was not checked? Who or what authored the code? What risks changed? What policy applies before merge?
The scarce primitive is not generation. It is a trustworthy answer to those questions.
Acceptance criteria are becoming infrastructure
When a human developer says a task is done, the organization usually has layers around that claim.
There is code review. There are tests. There is CI. There are linters and type checks. There may be security scanning, dependency review, staging deploys, observability, ownership rules, release gates, and incident history. None of these is perfect. Together, they turn an individual claim into an organizational decision.
AI coding agents compress the first part of that workflow. They can create more code, faster, with less human effort per line. That is useful. It also means the rest of the workflow has to carry more weight.
If the agent produces more code than the team can inspect, the bottleneck shifts. It moves away from typing. It moves away from autocomplete. It moves away from code generation itself.
It moves to acceptance criteria.
What exactly should this change do? How do we know it does that? What behavior must never regress? What dependencies are allowed? What files are sensitive? Which parts of the diff are agent-authored? Which findings are advisory, and which ones block the merge?
Those questions cannot live only inside a prompt. Prompts are private, lossy, and easy to drift. They disappear into chat history. They are not an audit trail. They are not a policy surface. They are not something a teammate, security reviewer, or future incident review can reliably inspect.
Acceptance criteria need to become infrastructure: explicit, executable, versioned, and attached to the place where code becomes real.
For most teams, that place is the pull request.
The pull request is the trust boundary
The chat window is where the agent feels productive.
The IDE is where the developer feels fast.
The pull request is where the organization takes responsibility.
That is the boundary that matters. Before a PR merges, the team still has choices. It can ask for tests. It can block risky dependencies. It can require a human owner. It can route a security review. It can demand clearer provenance. After merge, the code becomes part of the system. The blast radius changes.
This is why AI code governance belongs around the PR, not only inside the agent runtime. The agent runtime knows what was asked. The IDE knows what was edited. But the PR knows what the organization is about to accept.
ProgramBench evaluates candidate programs against a reference executable. Production teams do not usually have a reference executable for every feature. What they do have is a PR with context: the diff, the tests, the repository history, the dependency graph, the ownership model, the issue, the author, the CI result, and the policy the team says it follows.
That context is enough to build a practical harness.
Not a perfect one. A useful one.
What a practical harness checks
A serious AI code harness should not stop at "does the code compile?"
It should ask:
What part of this change was written by an AI agent, a human, or both?
Which agent or tool produced it?
What behavior does this PR claim to change?
Which tests support that claim?
What important behavior remains untested?
Did the diff introduce dependency, secret-handling, authentication, authorization, or data-loss risk?
Does the code shape match the repository's existing architecture?
Is this a small local patch, or a broad system change disguised as one?
What confidence should the team assign before merge?
Which policy should apply?
These are not academic questions. They are what engineering managers, CTOs, staff engineers, security leads, and compliance teams will ask once agent-authored code becomes normal.
They are also the questions that same-agent review cannot answer reliably enough by itself. The generator has already optimized for producing the artifact. The verifier has to optimize for distrusting it in useful ways.
That does not mean the verifier should be hostile to AI coding. The opposite is true. The more useful agents become, the more important the verification layer becomes. A team that does not trust the output will slow down or stop using the tools. A team with a strong harness can let agents do more.
Governance is not the enemy of velocity. It is what lets velocity survive contact with production.
The third signal
This is why ProgramBench belongs next to two other signals from the last few weeks.
The first signal is the everyday experience of using frontier models for serious work. The model says a task is done when it is not. That is not just a personality flaw. It is a missing monitoring layer.
The second signal is Anthropic's April 23 Claude Code postmortem. Three product-layer changes degraded Claude Code for weeks. Internal usage and evals did not initially reproduce the issues. The signal that broke through came from outside the system: users with reproducible complaints.
The third signal is ProgramBench. When agents are asked to build whole programs, partial progress is real, but full completion is rare. The interesting lesson is not that the models are doomed. It is that generation alone is an incomplete product. The surrounding harness determines whether the output can be trusted.
Model, vendor system, benchmark. Different levels, same pattern.
The check has to exist outside the thing being checked.
What we are building
That is the direction we are building at Veriva.
Veriva is not trying to be ProgramBench for every software task. Most production work does not have a clean reference executable. Most teams are not asking agents to rebuild SQLite from documentation. They are asking agents to modify real codebases under real constraints.
The production version of the ProgramBench principle is GitHub-native verification.
When a PR arrives, Veriva asks:
Who or what authored this code?
What changed?
What risks entered the repository?
What evidence supports the claimed behavior?
What patterns suggest AI-authored overreach?
What trust score should the team assign?
Should policy allow this merge?
The consumed artifact is not another chat response. It is a trust decision attached to the PR.
That is the difference between an agent saying "done" and a team knowing what it is about to accept.
Done needs evidence
The next phase of AI coding will produce more code than human teams can read line by line. That is not a prediction about some distant future. It is already happening in the teams using Claude Code, Cursor, Copilot, Codex, Devin, and related tools every day.
The old bottleneck was writing code.
The new bottleneck is knowing what to trust.
ProgramBench is valuable because it makes that bottleneck visible. It does not ask whether a model can emit plausible code. It asks whether the code behaves correctly under an independent harness. That is the right question. It is also the question every serious engineering team will have to ask before agent-authored code reaches production.
The future of AI coding will not be won by the system that writes the most code.
It will be won by the system that can tell teams, with evidence, which code is safe to trust.