State of the Science: Artificial Intelligence in Medical Education

By G.K. Schatzman

Since ChatGPT made its public debut in 2022, the exploding generative artificial intelligence (GenAI) industry has been driving headlines, markets, hopes and fears across sectors. On the floor of Congress, early testimonies about the potential dangers of a technology whose growth outpaces its guardrails have been largely supplanted by calls — even by the selfsame industry leaders — to secure America’s dominance in the global tech race. Organizations are grappling with how to safely leverage new capabilities, and as usual, higher education is thoroughly in the mix.

Current and future physicians at Drexel University College of Medicine are managing the challenge through policy, research initiatives, inspiring student projects and even a proposal for an AI literacy curriculum. From matriculation to graduation and beyond, large language models (LLMs) are quietly shaping medical education in every corner.

Applications & Admissions: The Chatbot Tradeoff

Vanessa Pirrone, PhD, assistant dean of admissions; associate professor, Department of Microbiology & Immunology

Beginning in fall of 2025, incoming medical students have encountered AI in their journey to Drexel — and not just in getting their applications together. In fact, Assistant Dean of Admissions Vanessa Pirrone, PhD, says students who use AI in creating their applications may unwittingly put themselves at a disadvantage.

The Office of Admissions doesn’t use an AI detector; application materials come through the centralized AMCAS platform, and the unreliability of current detection technology is well documented. Still, Pirrone says, students with computer-perfect applications risk “losing their authentic voice.” For Pirrone, when polished text can be produced to spec in a click, imperfections become personal, even precious, like the tell-tale craft marks of a handmade good.

“They’re not showing us who they are,” Pirrone says of applicants who overuse AI. “Oftentimes, we see this really polished but flat application, and for us, it just kind of blends in. We get 16,000 applications every year. You have to find ways to make yourself stand out, and the way that you stand out is by being yourself.”

Human touch is important for Pirrone, who makes a point of getting hands-on with applications and interviews to understand the unique talents and experiences of each incoming class.

“Every year when the students come in, I am so incredibly proud of all of them, because I see where they started,” she says. “Then I get to watch them through all the years and see the amazing things that they do. They’re serving the community. They’re living our mission every day and pushing the envelope.”

This fall, MD program admissions began using a new AI tool, AMP AI, integrated into its admissions management platform through ZAP Solutions. The tool offers mission-integrated insights like its competency analyzer, which ZAP claims “condenses and summarizes free text fields” into a customizable domain score. The tool provides admissions staff with extra metrics on how a candidate aligns with the school’s values. For now, only staff from the Office of Admissions will be training on the technology so that they can coordinate the shift.

Pirrone says the goal isn’t to spend less time on each application, but rather to “make us more consistent and ensure that we’re not losing the reason to say yes. When you go through applications, it’s not about finding the reason to say no. It’s about finding that reason to say, ‘Yes, this is a person that would enrich our university.’”

Human review and judgement still have priority, Pirrone says: “In the end, this is a tool, it’s not a person.” The office also remains committed to holistic review, understanding candidates in the context of their stories.

Learning Patient Care With Llama

Emily Spengler, MD, assistant director, Foundations of Patient Care I; assistant professor, Department of Pediatrics

Emily Feng, MD ’27

Michael Jayasuriya, MD ’27

Once they’ve arrived at Drexel, medical students spend their first two years focusing primarily on their didactic coursework, building the knowledge base necessary for their clinical work. Emily Spengler, MD, practices general outpatient pediatrics at St. Christopher’s Hospital for Children in North Philadelphia and is the assistant director of one of these early courses, Foundations of Patient Care I, where students learn the fundamentals of taking a patient’s history and physical in a relationship-centered, patient-focused manner.

Last year, two of her Foundations students, Emily Feng, MD ’27, and Michael Jayasuriya, MD ’27, approached her with a problem: Practice sessions with standardized patients (hired actors) felt too infrequent.

“The clinical skills portion of our curriculum is less emphasized in the first two years,” Jayasuriya explains. “We’ll get to talk to one standardized patient every month.”

With Jayasuriya’s background in software engineering and Feng’s in computational biology, the two were able to devise a solution. “Emily and I had an idea: Could we make an AI chatbot for us that would serve as a standardized patient? Then, you could just log in whenever you want, chat with the chatbot with your voice and it would speak back to you.”

Spengler’s interest in clinical skills, feedback systems and health literacy made her a match as a mentor for the project. While observed structural clinical examinations, or OSCEs, are a longstanding way of providing feedback to medical students on their patient interactions, Spengler agreed with her students that there is room for more support. “I think one of the problems with health literacy education is that, from medical school on through residency and as doctors, we’re not given much feedback on how clear we are when we talk to patients,” she says. “Patients don’t really tell you when they don’t understand. And once you’re no longer a med student, nobody’s really observing you and telling you, ‘Hey, I don’t think the patient understood that word.’”

“What I thought would be cool is if we incorporated some of the objective, real-time skills that AI is capable of into giving real-time feedback on these skills to students,” Spengler says. And the tool needn’t be limited to students; physicians could opt in, too. “My hope is that by getting this feedback, clinicians are better able to improve their ability to communicate with patients.”

With Spengler’s support and guidance on metric and feedback domains, Feng and Jayasuriya set about designing the Patient Interaction Analysis Tool, or PIAT Learn. For the user, whether student or clinician, it’s simple.

“You talk to the chatbot, it chats back to you,” Feng explains. In this case, “talk” is literal: It both receives and outputs audio for real-time practice, as well as a transcript for later review. “We’ve built in a feature to calculate certain metrics and give feedback immediately after the encounter.”

PIAT Learn provides a recap of the spoken grade level, how often you checked for understanding, and the number of questions asked, pauses offered and conversational “turns” taken — quantifiable ways of thinking about interaction dynamics. Perhaps more ambitiously, it also aims to provide feedback on when you used jargon and even when you showed empathy.

Speaking level was easy, Feng says; those kinds of metrics have already been validated in health literacy literature. The back-end work for coding empathy remains an ongoing challenge. But the progress they made on the jargon identification may provide a promising path.

“Jargon to me might be different from jargon to another person,” Jayasuriya says, and the original algorithms that leveraged word-use frequency packages didn’t always align with reality. “It would pick out words that are infrequently used in the English language but that I think are understandable, and it would miss words that are supposedly frequently used, but that might actually be unfamiliar to non-experts.”

Now, however, they’re approaching the problem through prompt fine-tuning: giving the bot, a version of Meta’s open-source Llama model, a role or persona and having it calibrate its jargon judgments accordingly. Then it flags instances of jargon and suggests substitutions for next time.

Feng and Jayasuriya are continuing to improve PIAT Learn, but they already had a chance to test it with some of Spengler’s residents at St. Christopher’s. At this stage, the experiment was for quality improvement rather than a proper research study, but it has already yielded helpful feedback on the user interface and instruction. Perhaps more importantly, it raised big-picture questions about metrics, surveillance and GenAI technology in the workplace.

A quarter of the residents involved in the user test expressed concerns about the new metrics, Jayasuriya says. Having AI listen in on their patient conversations could make them feel scrutinized. “The biggest concern that they noted was that they thought this was a tool to test them, when our goal was mainly just to provide an educational tool for them,” he says.

Like all of us grappling with increased AI integration, Spengler, Feng and Jayasuriya are working to discover the guardrails that support advances in medical practice while maintaining or even furthering its essential humanity.

“I don’t want an AI system to be grading me and docking my pay,” Jayasuriya says. “But I am still motivated by the idea that I want more ways to improve my clinical skills right now.” The two designers have committed to making the software “copy-left,” an open-source model that stipulates future branches of their work from other programmers also continue to be open-source.

Spengler imagines using a tool like PIAT Learn to create opportunities for side-by-side reflection between trainees and mentors, like watching film with a coach after a game. The potential draw to “perform to a metric rather than keep in mind the humanity of that doctor-patient interaction” is a concern for her, too. “I want my students to be thinking about, ‘Does my patient understand me? Am I communicating clearly?’ But the very first thing I always want them to be thinking about is preserving that doctor-patient relationship and keeping that connection with the patient.”

Creating metrics for the previously unquantifiable also presents an opportunity for soft skills — in Spengler’s opinion, the biggest blind spot of many beginning physicians — to finally get the attention they need.

“I think if this tool is shown to be valid — and we’re not there at all — this can just be one of the many other ways that we’re assessing medical students, just the way that multiple choice tests are,” Spengler says. “When we have valid evaluations of these softer skills, I think students will take them a little more seriously. And how do we know we’re effectively teaching something if we’re not assessing it? The more validity we can create in these assessments, the better.”

Sifting Surveys With ChatGPT

Carolyn Giordano, PhD, associate dean of assessment and evaluation; professor, Department of Family, Community & Preventive Medicine

Multiple-choice questions are popular on tests and surveys for a reason: They take guesswork out of grading and provide instant, clear-cut datasets. Free-response questions allow for a wider variety of expression, but ensuring reliable analysis across hundreds or thousands of responses is a process unto itself. While there are established research methodologies for categorizing and tagging content, their costliness in both resources and time limit their application. Researchers like Carolyn Giordano, PhD, associate dean of assessment and evaluation, hope that might be changing.

In addition to overseeing the exams that students take throughout their time in medical school, and all the course evaluations, surveys and peer evaluations, Giordano works with students interested in researching medical education, which focuses on everything from how medical schools are educating students to how medical systems are educating the public. Over the years, she says, her own interests have made her a sort of “go-to” person for all kinds of social science research.

“I’m kind of a fiddler,” Giordano says. “I naturally wonder about how you can analyze things more efficiently, and thought maybe we can look at AI.”

Over the course of medical school, Drexel students provide a huge amount of feedback to the school itself, in both fixed- and free-response formats.

“We have over 300 students. We have 60 evaluations a year, or more. We get a ton of survey information from student feedback and course feedback, and a human reads that. We read every single word that students tell us,” Giordano says. “Well, we started using Microsoft Copilot to read responses, to ask questions about where different themes showed up.” Using AI, the survey reviewers can draw insights about specific faculty or classes, or what students liked about a certain textbook or set of learning materials.

The tool itself isn’t trained in the methodology and nomenclature Giordano and her team use, she says, and is no replacement for human insight on surveys. “We always read the results. But between the end of the semester and the first day of classes, there’s not a lot of time to make changes. AI really helps with speed.” Now, a survey specialist reads the responses, leveraging AI to decrease turnaround time. In turn, Giordano receives the reports sooner, leaving more time for a secondary read before disseminating the results, which then leaves more time to implement changes.

Simran Shamith, MD ’26

But what about surveys outside the purview of student experience, in medical research? Simran Shamith, MD ’26, has worked with Giordano to leverage AI to create a survey validation tool that works in tandem with focus groups. Two years ago, they used the then-current free version of ChatGPT to validate a survey, examining it to see what made sense, what didn’t, and how people were processing the questions. And while they found the tool couldn’t replace focus groups, only complement them, its speed was on a different order of magnitude.

“We could do it in about six seconds versus one hour of hosting the focus group, hearing different feedback and synthesizing the findings,” Giordano explains.

Shamith thinks the tool can accelerate the survey creation process, allowing for rapid iteration before incurring the time and expense of a focus group for final review. Once again, the large language model excelled at fine-tuning survey language. “It wasn’t just giving me another word for bias,” Shamith explained in one example. “It was giving me ways to describe to students what I’m trying to get out of them when I’m saying ‘bias.’”

Policies, Dilemmas and Taboos

Discerning what generative AI can do and what it can’t, how it should be used and how it shouldn’t, is essential as Drexel contemplates adoption strategies. In fact, it’s a core element of the AI Fluency Framework¹ released last year by Anthropic, an industry-leading public benefit corporation noted for its commitment to more-responsible AI development. With a technology this disruptive and rapidly developing, policy struggles to keep up.

As of writing, Drexel’s Academic Integrity Policy page² mentions use of generative AI in the final bullet points of its sections on cheating and plagiarism, instructing students to follow instructors’ guidance on acceptable use. This follows a November 2023 policy³ from the Office of the Provost, which was up for review in fall 2025, that grants instructors “broad discretion to define the suitable use of Artificial Intelligence Tools in the classroom,” along with “the responsibility … to include in the course syllabus a clearly written description of the permitted use of AI tools.” In turn, the policy outlines students’ responsibilities to adhere to instructors’ policies, cite AI usage appropriately, and bear ultimate responsibility for the work they submit. AI detection tools are discouraged due to their documented unreliability. A separate Information Security page⁴ directs faculty and staff to use only approved GenAI tools in order to ensure the privacy of sensitive data, a list of which can be found on the Provost’s AI at Drexel home page⁵ alongside a new Digital Commons space design for faculty and staff to share insights and practices.

For many instructors, though perhaps fewer today than two years ago, the “ch” in ChatGPT stands for “cheating.” Plagiarism is a leading concern. As Giordano argues, though, it isn’t one the College is unequipped to handle.

“The toothpaste is out of the tube, and we’re not putting that back. The tool exists. But rules against plagiarism exist too, so we can talk about that in an open way. You were never able to plagiarize material and pretend it was your own.”

Misinformation is another concern raised by students and faculty alike. Using generative AI is quicker — much quicker — than consulting traditional reference materials, even if we take “traditional” to mean PubMed instead of a library book.

“Students are humans, and humans do love shortcuts, don’t we?” Giordano says. “One of my worries is that students use AI in place of other reliable materials. I think that they have to use this as one paintbrush in the whole artistic arsenal.”

Unlike databases that require a degree of baseline knowledge to search and synthesize results, large language models are approachable from ignorance. A question asked in plain language will provide a response that will make sense to the user whether or not it is, in fact, accurate. What educators call “productive struggle” — incrementally building durable knowledge foundations by grappling with challenges just beyond your current understanding — can be diminished simply by virtue of the technology’s ease.

For Giordano, a training opportunity can be found in the imperfections of large language models. “When you put a prompt into ChatGPT, it often gives you too much information, not enough information or inaccurate information,” she says. “I think that students are learning that this is also what humans do. When they go to take a patient’s history, the patient is going to talk too much, talk too little, forget information or just make things up. In that way, I think it’s actually training them pretty well to be a future physician.”

Shamith also identified misinformation, for both patients and physicians, as a concern. Still, as a student, she benefits from AI as a study tool. “Other students and I have started asking ChatGPT, ‘I’m going into this case. What are some of the things that I might see? What are the steps that we’re going to do? What are things that I could get quizzed on?’ And honestly, that has been life changing.”

As AI becomes increasingly integrated into our favorite software, we’re all likely using it more often than we realize, from text-editing to suggested email replies. If GenAI is indeed becoming ubiquitous, what are the risks of not understanding it — and who is leading the charge in AI literacy?

Currently at the College, students are pushing the envelope with ideas for new AI initiatives and integrations. Faculty instructors and mentors have guided those initiatives, helping locate them within the larger system of medical education.

“Dr. Giordano and Dr. Spangler are amazing in the sense that they’re able and willing to adapt. The fact that they’re willing to learn from their students is amazing. I don’t think teachers have to know everything,” Shamith says. “It’s an evolving landscape for everyone.”

Even so, it’s paramount that students be prepared for the current state of science, and self-discovery may not always be sufficient.

“I’m not an educator yet, but I think it’s our responsibility, or teachers’ responsibilities, to teach students how to correctly use AI, just like our teachers taught us how to correctly use sources and how to research and use PubMed,” Shamith says. “I think it’s just the next version of all this. Before us, our parents were taught how to use books. Then we were taught how to use the internet. Now, we’re going to be taught how to use AI.”

A Charge to Be Led

Spencer Moavenzadeh, MD/PhD student

Artificial intelligence curriculum is a current passion of Spencer Moavenzadeh, a second-year MD/PhD student with a background in computer science and biomedical engineering. Despite the buzz around generative AI models, and especially large language models like ChatGPT, he emphasizes that these aren’t the only models being used in research.

For example, while Moavenzadeh was working as an ultrasound engineer, he had a project where they optimized three modalities of ultrasound into a single image using a neural network — a machine learning model that works sort of like our brains do, and that predates our current large language models. As he explains it, “They’re all mechanisms by which the algorithm effectively learns to or works its way iteratively to a solution that is optimized.” In AI research at large, machine learning [ML] is closely partnered, and AI/ML applications for research, including medical research, abound, from random forest decision algorithms to merging images for comprehensive analysis.

“As the physician reading this image, I think you would want to know that it is effectively an artificial creation. Yes, it is grounded in three images that are somewhat real, but they are merged together in a deep neural net,” which affects what parts of the data are directly interpretable, Moavenzadeh says. “If there’s an artifact in there, it would help, in my opinion, to know what the basis of the model being used to design it was, to see whether or not you can trust that. I think that is going to be a challenge that a lot of physicians will face in the future, both while interpreting papers and the new state-of-the-art thing that comes out, or while doing their own research.”

That’s why Moavenzadeh is actively working to develop an AI literacy curriculum for the College of Medicine that covers not only the more recent rise of generative AI but also the widespread use of machine learning in all kinds of medical research, a project he also considers a way of fostering his own learning. His proposal is two-pronged: First, create a condensed Foundations of Machine Learning elective course focused on the principles of machine learning and the models currently in use, and second, provide concrete examples of current implementations in medicine. The latter, he imagines, might be a good opportunity for a speaker series.

The medical school curriculum is already busy, but it makes sense to look for time in intersession or elsewhere — because in Moavenzadeh’s view, AI, from large language models like our favorite chatbots to neural networks for deep processing, really is everywhere. “I would say every researcher is probably using it. Almost in every field, and probably every single lab to at least some degree.”

There’s no shortage of big decisions on the horizon, and when it comes to AI itself, none of us control the pace of innovation. Drexel has pioneered programs in physician-patient communication, medical humanities and professional formation — an attitude toward innovation that, along with those who take the initiative, can serve the school well in the AI realm. Fortunately, there are excellent people on the task. As Assistant Dean Pirrone says with a smile full of pride, “Don’t we just have the best group?”

Resources:

anthropic.skilljar.com/ai-fluency-framework-foundations
drexel.edu/studentlife/community-standards/ code-of-conduct/academic-integrity-policy
bit.ly/47Sivyi
drexel.edu/it/security/policies-regulations/ai-guidance
drexel.edu/provost/ai