AI · 6 min read · Studio · Senior partner

Eval-first AI: a small manifesto

Every AI engagement we take begins with the eval set, not the model. A short note on why.

When a client comes to us with an AI project, the first artifact we ask for is not a model card or a roadmap. It is a list of two hundred questions the system must get right, written by the people who will be embarrassed if it gets them wrong.

We call this the eval set. It is the entire engagement, compressed. If we cannot agree on what the right answers look like, no model will save us — and if we can, then choosing the model becomes a tractable problem with a measurable answer.

The order matters

Most teams write the eval after the demo works. We invert it. The eval is week one. The demo is week four. By the time we plug in a model, we already know whether the system is good — because good has a definition, and the definition is a file in the repo.

A model without an eval is a vibe. A vibe that ships to a regulated environment is a liability.

On a recent clinical AI engagement this discipline produced a quieter result than the headline number suggests. The clinical co-pilot is good not because the model is exotic — it is mostly off the shelf — but because the eval set was written by attending physicians and is run on every commit. The system is allowed to fail, but it is not allowed to fail silently.

This is not glamorous work. The eval set is mostly text, the harness is mostly bookkeeping, and the regression dashboard is mostly green. That is the point. The boring scaffolding is what lets the interesting parts of the model be trusted.

← Back to journal
Read next
Engineering

Why we still write Postgres by hand

On boring databases, careful indexes, and the slow advantage of writing your own SQL.

Practice

The shape of a 22-person studio

A note on why we are not growing, and the inversion of the agency org chart we are running instead.