Three large language models—ChatGPT4-0, Grok2, and Gemini—successfully passed written assessments administered to UK dental and dental hygiene students, according to a recent study evaluating AI performance in education.
The models were tested on their ability to answer and generate questions across multiple examination formats. Researchers evaluated the models using 340 multiple choice questions (MCQs), 80 short-answer papers (SAPs), and 3 structured oral examinations from the Bachelor of Dental Surgery (BDS) and Dental Hygiene and Therapy (DHT) programs. Each model also generated 140 new exam questions across various dental topics and difficulty levels.
All three models passed the MCQ and SAP assessments. In the BDS cohort, no statistically significant differences were observed in MCQ (P = .71) or SAP performance (P = .07). In the DHT SAPs, however, GPT-4o and Grok2 significantly outperformed Gemini (P = .01 and P < .001, respectively). Grok2 achieved the highest scores in the BDS SAPs, although differences were not statistically significant.
Despite strong performance in answering existing exam questions, the models’ ability to generate new exam content was inconsistent. Common issues included double negatives, outdated or incorrect information, and overly complex answer choices that exceeded undergraduate expectations.
For example, one SAP required students to explain why surgery might not be considered despite strong clinical indications. The question included a lengthy and confusing scenario that relied on unclear logic. MCQ generation errors included multiple arguably correct options or inclusion of advanced content such as implant surface specifications, which falls outside the undergraduate curriculum.
LLM-generated SAP mark schemes often lacked sufficient clarity or were disproportionately detailed relative to the mark allocation. Some schemes were too vague for consistent scoring, while others were overly exhaustive and impractical for undergraduate assessments.
Structured oral exam generation was particularly poor. While ChatGPT4-0 and Grok2 produced a few usable question stems, none of the models generated complete exams that met acceptable standards. All required substantial human revision.
The researchers evaluated the AI-generated content using Bloom’s taxonomy and Miller’s pyramid. Most questions targeted lower-level cognitive skills, such as recall and comprehension. Some harder questions in subjects like oral surgery or biosciences reached the application level, but none assessed higher-order thinking skills, such as analysis or evaluation.
The authors concluded that while large language models demonstrated the ability to pass undergraduate dental assessments, their reliability in generating usable exam content remains limited. Human oversight is necessary to ensure that exam materials are accurate, appropriate, and aligned with intended learning objectives.
The authors declared no conflicts of interest.
Source: BDJ