Dr. Scott Gottlieb is a physician who served as the 23rd Commissioner of the United States Food and Drug Administration. He is a CNBC contributor and a board member of Pfizer and several other health and technology startups. He is also a partner at venture capital firm New Enterprise Associates. Shani Benezra is a senior fellow at the American Enterprise Institute and a former associate producer of CBS News’ “Face the Nation.”
Many consumers and healthcare providers are turning to chatbots powered by large language models to answer medical questions and inform treatment options. We decided to see if there were significant differences in clinical capabilities between the leading platforms.
In order to obtain a medical license in the United States, aspiring doctors must successfully pass three stages of the United States Medical Licensing Examination (USMLE), with the third and final stage widely considered the most challenging. It requires candidates to answer approximately 60% of questions correctly, and historically, average passing scores have hovered around 75%.
When we gave the same Step 3 exam to the leading large language models (LLMs), they performed significantly better than many doctors, achieving scores that significantly exceeded many doctors.
But there are some clear differences between these models.
USMLE Step 3, usually taken after the first year of residency, measures whether medical graduates can apply their understanding of clinical science to the unsupervised practice of medicine. It assesses new physicians’ ability to manage patient care across a broad range of medical disciplines and includes multiple-choice questions and computer-based case simulations.
We isolated 50 questions from the 2023 USMLE Step 3 sample test to assess clinical proficiency on five different leading large language models, providing the same set of questions to each platform – ChatGPT, Claude, Google Gemini, Grok and Llama.
Other research These models have been evaluated medical levelBut as far as we know, this is the first time these five leading platforms have been reviewed positively. These results can provide consumers and providers with some insights into where they should turn.
Their scores are as follows:
- ChatGPT-4o (Open AI) — 49/50 questions correct (98%)
- Cloud 3.5 (Anthropic) — 45/50 (90%)
- Gemini Premium (Google) — 43/50 (86%)
- Grok (xAI) — 42/50 (84%)
- HuggingChat (Llama) — 33/50 (66%)
In our experiments, OpenAI’s ChatGPT-4o performed best, scoring 98%. It provides detailed medical analysis using language reminiscent of medical professionals. Not only does it give an answer with extensive reasoning, but it also puts it into context of the decision-making process, explaining why alternative answers are less appropriate.
Claude from Anthropic came in second place with a score of 90%. It provides a more humane response through simpler language and a bullet point structure that may be more acceptable to patients. Gemini, which scored 86%, gave less thorough answers than ChatGPT or Claude, making its reasoning harder to decipher, but its answers were concise and clear.
Grok, the chatbot from Elon Musk’s xAI, scored a whopping 84%, but provided no descriptive reasoning during our analysis, making it difficult to understand how it arrived at its answers. HuggingChat — an open source website developed by Yuan Llama – scored the lowest at 66%, however, it showed good reasoning for the questions it answered correctly, providing succinct answers and links to sources.
One issue where most of the models got it wrong concerned a 75-year-old woman with a hypothetical heart disease. This question asks the physician what the next most appropriate step is as part of her evaluation. Crowder was the only model that produced the correct answer.
Another noteworthy issue focused on a 20-year-old male patient who presented with symptoms of a sexually transmitted infection. As part of the exam, it asks the doctor which of five options is the appropriate next step. ChatGPT correctly determined that patients should schedule an HIV serology test within three months, but the model went a step further and recommended a follow-up test within a week to ensure that the patient’s symptoms were resolving and that antibiotics were covering their infecting strain. To us, this response highlights the model’s ability to reason more broadly, beyond the binary choices presented by the exam.
These models are not designed for medical reasoning; They are products from the consumer technology space designed to perform tasks such as language translation and content generation. Despite their non-medical origins, they display surprising clinical reasoning abilities.
New platforms are being built specifically to solve medical problems. Google recently launched Med-GeminiIt is an improved version of the previous Gemini model, fine-tuned for medical applications and equipped with web-based search capabilities to enhance clinical reasoning.
As these models develop, their skills at analyzing complex medical data, diagnosing conditions, and recommending treatments will improve. They may offer a level of accuracy and consistency that human providers can sometimes struggle to match due to fatigue and errors, and open the way to a future where treatment portals could be powered by machines rather than doctors.