Generalist annotators cannot catch what a doctor can.
Crowd labelers rate fluent, confident answers as safe. A warfarin patient asking about ibuprofen gets a green check. A 78-year-old with confusion and fever gets routed to a GP visit next week. The model looks good in eval. The harm shows up in production.