For many students, asking ChatGPT for homework help has replaced raising a hand in class or going to office hours. As AI becomes a default classroom tutor, educators are grappling with a new question: how do you design these tools so they actually support learning?

A new paper from Angel Tsai-Hsuan Chung (Wharton, PhD Candidate in OID), Botong Zhang (Penn), Prof. Ling-Chieh Kung (National Taiwan University), Prof. Hamsa Bastani (Wharton), and Prof. Osbert Bastani (Computer and Information Science, Penn) explores how small design changes can make AI tutoring more effective by emulating the most effective practices of human instructors. This experiment was funded in part by the Mack Institute.
The researchers built an AI tutoring platform that gives all students access to the same GenAI chatbot and course materials, but varies the sequence in which practice problems are assigned. In a five-month Python course across 10 Taipei high schools, students were randomly assigned to one of two groups: one received a standard sequence of problems progressing from easy to hard, while the other received a personalized sequence, in which an algorithm adjusted problem difficulty based on each student’s performance and interactions with the AI tutor. Because everything else was held constant, this design isolates the impact of personalized homework.
Students who received personalized problem sequences significantly outperformed those following the standard curriculum, improving exam scores by 0.15 SD (by some estimates, equivalent to six to nine months of additional learning), without increasing instructional time or teacher workload.
We spoke to Bastani and Chung about their experiment and what it can teach us about designing better AI for the classroom.
Q: What inspired this classroom experiment?
BASTANI: Our previous research showed that generative AI can harm learning by encouraging over-reliance. Students end up asking for help too early and undermining their own “productive struggle.” So one of the questions we began exploring was whether we could preserve that productive struggle through personalization.
In other words, AI could offer more than just 24/7 access to a teaching assistant; it could also enable personalized learning. In a traditional classroom, you have to teach to the lower middle, something like the 33rd percentile student. You want to make sure the strongest students are not bored, but you also need to revisit material for those who are struggling. Even then, you often can’t fully reach the students who are having the hardest time.
As a result, it’s difficult to match the pace for everyone. The idea was to personalize practice problems in a way that both creates more productive struggle and better matches each student’s current level of mastery.
CHUNG: The key point is personalization. If you don’t personalize—or if you focus on more advanced students—slower students can’t catch up. But if you move too slowly, the more advanced students aren’t going to learn much either. With AI, you can really tailor instruction to each student’s individual learning trajectory and use that to adapt and help them learn more.
Q: This experiment is based on an educational chatbot you developed to assist students in a Python certification course. How did the chatbot and the experiment work?
CHUNG: There is a single chatbot, with guardrails designed to preserve learning. It follows a clear pedagogical approach—it does not give students answers directly. Instead, students need to show effort, and the chatbot provides guidance to support their learning.
The only difference between the control and treatment groups is the sequence of practice problems.
BASTANI: Students work through a series of problems drawn from a bank of about 40 per module. In the control group, the sequence is fixed: similar to standard homework, with problems progressing from easy to medium to hard.
In the treatment group, the sequence is personalized. We use data from chatbot interactions—how students engage with the AI—as well as platform data, such as their code submissions and solution attempts, to infer their current level of understanding.
Based on those signals, we dynamically adjust the difficulty of the next problem: students either stay at the same level or move up, depending on whether they’re ready.

Q: The students who were given a personalized question sequence performed better on the final exam, possibly gaining as much benefit as an additional 6 to 9 months of learning. How did you measure and interpret those results?
BASTANI: We find that students in the personalized group perform about 0.15 standard deviations better on the final exam. According to some estimates, that corresponds to roughly six to nine months of additional learning, although it’s hard to map that precisely.
What’s important is that the effects are meaningful, and they’re essentially free. We didn’t increase instruction time or teacher workload.
CHUNG: In education, effects are typically noisy and difficult to measure, so even results of this size are meaningful.
What’s also important is that both the control and treatment groups had access to the same AI tutor. Many studies show large gains from introducing AI tools, but in our case, the only thing we changed was how practice problems were assigned.
That’s a relatively small, nuanced intervention, so it was somewhat surprising to see an effect of this magnitude.
Q: Why is it important to study chatbots and generative AI in the classroom?
BASTANI: One of the things that’s happened with generative AI is that, as a professor, I now have no idea what’s going on with most of my students. They don’t come to office hours anymore.
Before, maybe half the students were hesitant to ask questions because they were worried about looking dumb, but the other half would come. Now, when I talk to colleagues, very few of them have students coming to office hours.
Nobody wants to ask questions in person because they can just ask ChatGPT. This has actually made things more difficult. The students who used to come and ask questions don’t anymore, so we have to infer what they’re struggling with from their GenAI chats.
Otherwise, we have no way of getting that information until we give them an exam, and, by then, it’s too late. This is why leveraging these signals is so important—to understand proactively what’s going on with students.
CHUNG: That’s also where “proactive” personalization becomes important. A lot of people think AI is already personalized, because it responds to whatever question you ask. But in education, the challenge is that you often don’t know what question to ask. You don’t know your current level of understanding or where you should push yourself to improve. If you don’t know what to ask, then you won’t benefit from the AI tutor, because you can’t even formulate the right question.
Through interaction with the platform, we can leverage signals from student behavior to provide the kind of personalization that helps push learning forward.
***
Read the full paper, “Effective Personalized AI Tutors via LLM-Guided Reinforcement Learning”, here.

