When a client comes to us with an AI project, the first artifact we ask for is not a model card or a roadmap. It is a list of two hundred questions the system must get right, written by the people who will be embarrassed if it gets them wrong.
We call this the eval set. It is the entire engagement, compressed. If we cannot agree on what the right answers look like, no model will save us — and if we can, then choosing the model becomes a tractable problem with a measurable answer.
The order matters
Most teams write the eval after the demo works. We invert it. The eval is week one. The demo is week four. By the time we plug in a model, we already know whether the system is good — because good has a definition, and the definition is a file in the repo.
A model without an eval is a vibe. A vibe that ships to a regulated environment is a liability.
On a recent clinical AI engagement this discipline produced a quieter result than the headline number suggests. The clinical co-pilot is good not because the model is exotic — it is mostly off the shelf — but because the eval set was written by attending physicians and is run on every commit. The system is allowed to fail, but it is not allowed to fail silently.
This is not glamorous work. The eval set is mostly text, the harness is mostly bookkeeping, and the regression dashboard is mostly green. That is the point. The boring scaffolding is what lets the interesting parts of the model be trusted.
