In brief
- Nearly fractional of AI chatbot responses to wellness questions were rated "somewhat" oregon "highly" problematic successful a BMJ Open audit of 5 large chatbots.
- Grok produced importantly much "highly problematic" responses than statistically expected, portion nutrition and diversion show questions fared worst crossed each models.
- No chatbot produced a afloat close notation list.
Nearly fractional of the wellness and aesculapian answers provided by today's astir fashionable AI chatbots are wrong, misleading, oregon dangerously incomplete—and they're delivered with full confidence. That's the header uncovering of a caller peer-reviewed study published April 14 successful BMJ Open.
Researchers from UCLA, the University of Alberta, and Wake Forest tested 5 chatbots—Gemini, DeepSeek, Meta AI, ChatGPT, and Grok—on 250 wellness questions covering cancer, vaccines, stem cells, nutrition, and diversion performance. The results: 49.6% of responses were problematic. Thirty percent were "somewhat problematic," and 19.6% were "highly problematic"—the benignant of reply that could plausibly pb idiosyncratic toward ineffective oregon unsafe treatment.
To stress-test the models, the squad utilized an adversarial approach—deliberately phrasing questions to propulsion chatbots toward atrocious advice. Questions included whether 5G causes cancer, which alternate therapies are amended than chemotherapy, and however overmuch earthy beverage to portion for wellness benefits.
"By default, chatbots bash not entree real-time information but alternatively make outputs by inferring statistical patterns from their grooming information and predicting apt connection sequences," the authors write. "They bash not crushed oregon measurement evidence, nor are they capable to marque ethical oregon value-based judgments."
That's the halfway problem. The chatbots aren't consulting a doctor—they're pattern-matching text. And pattern-matching connected the internet, wherever misinformation spreads faster than corrections, produces precisely this benignant of output.
The researchers continue: "This behavioural regulation means that chatbots tin reproduce authoritative-sounding but perchance flawed responses." Out of 250 questions, lone 2 prompted a refusal to answer—both from Meta AI, connected anabolic steroids and alternate crab treatments. Every different chatbot kept talking.
Performance varied by topic. Vaccines and crab fared best—partly due to the fact that high-quality probe connected those subjects is well-structured and wide reproduced online. Nutrition had the worst statistical show of immoderate class successful the study, with diversion show adjacent behind. If you've been asking AI whether the carnivore fare is healthy, the reply you got was astir apt not grounded successful technological consensus.

Grok stood retired for the incorrect reasons. Elon Musk's chatbot was the worst performer of immoderate exemplary tested. Of its 50 responses, 29 (58%) were rated problematic overall—the highest stock crossed each 5 chatbots. Fifteen of those (30%) were highly problematic, importantly much than expected nether a random distribution. The researchers link this straight to Grok's grooming data: X is simply a level known for spreading wellness misinformation rapidly and widely.
Citations were a abstracted disaster. Across each models, the median completeness people for references was conscionable 40%—and not 1 chatbot produced a afloat close notation list. Models hallucinated authors, journals, and titles. DeepSeek adjacent acknowledged it: The exemplary told researchers its references were generated from grooming information patterns "and whitethorn not correspond to actual, verifiable sources."
The readability occupation compounds everything else. All chatbot responses scored successful the "Difficult" scope connected the Flesch Reading Ease scale—equivalent to assemblage sophomore-to-senior level. That exceeds the American Medical Association's proposal that diligent acquisition materials should not spell beyond sixth-grade speechmaking level.
In different words, these chatbots use the aforesaid instrumentality politicians and nonrecreational debaters thin to do: sprout you truthful galore method words successful truthful small clip that you extremity up reasoning they cognize much than they do. The harder thing is to understand, the easier it is to misinterpret.
The findings echo a February 2026 Oxford survey covered by Decrypt that recovered AI aesculapian proposal nary amended than accepted self-diagnosis methods. They besides way with broader concerns astir AI chatbots delivering inconsistent guidance depending connected how questions are framed.
"As the usage of AI chatbots continues to expand, our information item a request for nationalist education, nonrecreational training, and regulatory oversight to guarantee that generative AI supports, alternatively than erodes, nationalist health," the authors conclude.
The survey lone tested 5 free-tier chatbots, and the adversarial prompting method whitethorn overstate real-world nonaccomplishment rates. But the authors are direct: the occupation isn't the fringe cases. It's that these models are deployed astatine scale, utilized by non-experts arsenic hunt engines, and configured—by design—to astir ne'er accidental "I don't know."
Daily Debrief Newsletter
Start each time with the apical quality stories close now, positive archetypal features, a podcast, videos and more.

4 days ago
8







English (US) ·