Regulators expect human oversight
FDA's AI/ML SaMD action plan and the EU AI Act both require demonstrable human-in-the-loop validation for high-risk clinical AI. Crowd labelers do not meet the bar.
The volume of clinical AI shipping into hospitals has multiplied 240× in a decade. The evaluation infrastructure has not kept up.
Better Data. Better AI. Better Patient Outcomes.
FDA-authorized AI/ML medical devices as of Dec 2025 — up from 6 in 2015.
New FDA authorizations in 2025 alone — a record year, almost all via 510(k).
FDA's Predetermined Change Control Plan formalises lifecycle oversight of learning AI/ML devices.
High-risk medical AI is regulated alongside MDR/IVDR — post-market clinical monitoring is now an obligation.
Frontier LLMs still fabricate diagnoses, medications and contraindications in safety-critical scenarios, even when summary quality looks high to a generalist reviewer.
Safety and accuracy follow different scaling laws in clinical LLMs: a bigger model can be more accurate on average yet more dangerous on the long tail. Generalist labelers cannot tell the difference.
Large-scale simulations of common presentations expose systematic reasoning failures across age, sex and comorbidity — failures invisible in standard multiple-choice evals.
Hallucination rates in medical text summarisation remain material — the same task AI scribes ship every day in production EHRs.
FDA's AI/ML SaMD action plan and the EU AI Act both require demonstrable human-in-the-loop validation for high-risk clinical AI. Crowd labelers do not meet the bar.
Health systems, payers and pharma procurement teams increasingly demand clinician-validated evaluation datasets before contracting AI vendors.
Domain-expert feedback — not crowd ratings — is the differentiator for safety-critical models.