Clinical Co-pilot
An evaluation-first clinical co-pilot. Every answer is sourced, scored, and traceable to a citation set. Deployed inside two academic medical centers, used daily.
The state of play.
The founders were attending physicians, and they had a clear constraint: a clinical co-pilot that could not be trusted to cite its sources was worse than no co-pilot at all. The first prototypes worked beautifully in a demo and badly in front of an attending. The team needed an evaluation harness before they needed a better model.
What we built.
We started with the eval set, written by attending physicians at the two academic medical centers piloting the system — two hundred questions, every answer scored against a known reference. We built the retrieval and citation layer against the eval, then a LangGraph agent on top, then the model, in that order. SOC2 and HIPAA controls were wired in from the first commit.
What shipped.
Eleven-second average answer time, end to end. F1 of 0.94 against attending consensus on the eval. One hundred percent of answers carry a traceable citation set. The system is in daily clinical use across two academic medical centers, and the eval runs on every commit.
They redesigned a category. Our board now uses our app as the reference for what good feels like.
- Lead engineer
- AI systems engineer
- Compliance engineer
- Design lead
