The Role of Large Language Models in Educational Simulations

Mack Institute Research Assistant Lennart Meincke and Wharton Associate Professor of Management Andrew Carton have released a new working paper in our series of working papers on ChatGPT spearheaded by Mack Faculty Director Christian Terwiesch. The new paper, entitled “Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations,” compares feedback on student work generated by LLMs to feedback provided by human instructors and Natural Language Processing (NLP) techniques.

The abstract of the paper reads:

This case study explores the potential of Large Language Models (LLMs), specifically GPT-3 and GPT- 4, to enhance the educational experience through real-time simulation feedback. Utilizing a custom-built educational simulation for multiple classes at Wharton, we compared the real-time feedback generated by LLMs against that provided by a human instructor. In addition, we compared the LLM results against the first iteration of the simulation, which utilized traditional natural language processing (NLP) techniques. The evaluation was conducted across three distinct cohorts at Wharton – undergraduate students, Daytime MBA students, and Executive MBA students – with multiple iterations and improvements. Our results show that LLMs dramatically improved real-time feedback provided to students when compared to traditional NLP methods, at very low cost. In addition, the leap from GPT-3 to GPT-4 is significant, boosting correlations between model and instructor ratings from 0.33 to 0.77. Students commented on how real-time feedback to their open-ended responses was a major improvement over traditional simulations, which typically involve students responding to multiple choice questions or otherwise making decisions according to a fixed set of options. The simulation was the highest rated out of a dozen exercises in a midterm poll of undergraduates taking a core management class, outperforming other well-received exercises, such as Harvard’s Everest Simulation. We discuss the implications of these findings for educational simulations, the associated risks of deploying LLMs, and the student classroom experience.

Read the full paper here