Essay · May 7, 2026

The Hard Part Is Knowing When the Code Is Done

ProgramBench shows why AI coding needs executable acceptance criteria, not just stronger generators.

Alexander Theruviparambil— Founder10 min read

Every developer using coding agents knows the moment: the agent says it is done.

The tests are green. The diff looks plausible. The explanation sounds complete. The agent has a tidy summary of what changed, why it changed, and why the task is finished.

Then you inspect the code and realize the task is only half solved.

Not broken in an obvious way. Not useless. That would be easier. The harder failure is code that is mostly right, locally convincing, and operationally unsafe to trust. The implementation handles the common case but misses the weird one. It copies the shape of an existing abstraction but not the reason for it. It adds the check you asked for in one path and forgets the other two. It makes progress, and progress is exactly what makes the mistake harder to see.

That is the part of AI coding that still feels undernamed. The problem is not that models cannot write code. They can. The problem is that software teams do not merge "some progress." They merge behavior, ownership, tests, risk, and policy into a shared system. The agent can tell you the work is done. The question is what proves it.

A new benchmark called ProgramBench is the cleanest public benchmark I have seen for that gap.

A benchmark for the moment after done

ProgramBench asks a brutal question: given only a compiled program and its documentation, can a software engineering agent rebuild the source code project from scratch?

Not fix a bug. Not implement one well-specified feature. Not fill in a function body with tests waiting nearby. Rebuild a working program.

The benchmark includes 200 tasks, from compact command-line tools to widely used software such as FFmpeg, SQLite, PHP, jq, fzf, DuckDB, ripgrep, and zstd. The agent does not get the original source. It gets the executable and documentation. It has to infer behavior, decide on an architecture, write a codebase, and produce something that behaves like the reference program.

That setup matters because it removes a crutch most coding benchmarks still rely on. Many benchmarks prescribe the shape of the work. There is a file to edit, a failing test to satisfy, a small patch to land. ProgramBench moves closer to the work agents are starting to do in the real world: turn an intent into a functioning software system.

The Hard Part Is Knowing When the Code Is Done

A benchmark for the moment after done

The wrong lesson

The important move is the harness

Acceptance criteria are becoming infrastructure

The pull request is the trust boundary

What a practical harness checks

The third signal

What we are building

Done needs evidence