From AI Tutors to AI Study Mates
New research reveals how AI can enable real learning — not just productivity gains
Hey folks 👋
A quick question: if I told you that a tool helped students perform better on maths problems while they were using it, but left them measurably worse at the same maths once it was taken away — would you call that tool a learning aid?
This is what happened in a large randomised study of nearly 1,000 high school maths students in Turkey, run by researchers at Wharton and Penn (Bastani et al., 2025). Students were split into three groups: one had access to a standard ChatGPT-4 interface (”GPT Base”), one had access to a version with teacher-designed guardrails that gave hints rather than answers (”GPT Tutor”), and a control group had only textbook and notes.
During the practice sessions, the AI groups crushed it. The GPT Base group performed a whopping 48% better than the control group. The GPT Tutor group performed 127% better.
Then, the researchers took the AI away and gave everyone an exam.
The students who had used the unguarded GPT Base performed 17% worse on the exam than the control group who had never had AI access at all. The researchers’ explanation was blunt: without guardrails, students had used GPT-4 as a “crutch” during practice, and subsequently performed worse on their own.
TLDR: AI tutors didn’t help students to learn - it helped them to perform.
This is the starting point of a major new paper from Khosravi and a roster of the most influential researchers in AI and education (May, 2026).
The researchers call failure of AI-tutors to actually enable learning the “learning-performance paradox”. Their key point is this: AI tutors tend to be optimised for task completion and actively undermine the processes through which durable learning happens.
The paper provides the most concrete and compelling case yet for the need to build very specific sorts of AI tools to support human learning. The authors argue that those building these tools must move past two failed patterns — default LLMs and AI tutors — and toward something new: AI learning companions (or, as I think of them, AI study mates).
Here’s the TLDR:
An LLM waits for your question or challenge, then answers or solves it for you — giving you faster work, but less learning.
An AI tutor or “study mode” waits for your question and asks questions back, regardless of what you actually need — delivering frustration, disengagement, and drop-out.
An AI study mate remembers what you’ve worked on, knows where you got stuck, and pushes you toward the thinking you’re avoiding — resulting in capability that persists when the AI is gone.

I read the paper carefully and extracted what we need to know as L&D professionals.
Let’s dive in!
The failure mode nobody is talking about
The paper opens with a sharp claim: AI tools optimised for task completion actively undermine the processes through which durable learning develops.
This observation is grounded in three pieces of evidence:
Bastani et al. (2025) — the high school maths study above. Better performance during, significant harm after.
Corbett & Tangen (2026) — AI dialogue beat textbook refutation at changing students’ beliefs immediately after the intervention. By the two-month follow-up, the advantage had completely disappeared.
Fan et al. (2024) — what the authors call metacognitive laziness: when AI is available, learners abdicate the planning, monitoring and self-evaluation that make learning durable. The work gets done. The capability doesn’t develop.
The point isn’t that AI is inherently bad for learning — it’s that the meta-analyses showing that LLMs improve assignment and performance scores are measuring the wrong thing. They’re measuring performance with the AI present, not learning that persists once it’s gone.
As Yan, Greiff, Lodge and Gašević (2025) put it bluntly in Nature Reviews Psychology: we’re confusing performance gains with learning.
In practice, this means every “AI improved completion rates by 40%” pitch you’ve been getting this year is potentially measuring the exact outcome that doesn’t matter. The outcome that does matter — retention, transfer, capability that persists when the tool is gone — is the one nobody’s reporting.
AI for Work vs AI for Learning
There’s one core argument made in the paper which I think all L&D professionals to internalise.
The LLMs entering education and training were not built for education or training - they were built for work. The logic that makes them useful at work — optimise the output, minimise cognitive effort, treat each interaction as disposable — is precisely what makes them unsuitable as learning tools.
The authors lay out the significant differences between “AI for Work” and “AI for Learning” across nine dimensions. Here’s what they look like clustered them into three groups:
Look at that list and a clear conclusion emerges: if we want to deliver AI tooling which supports substantive learning, we need to intentionally create a new category of AI tool for “learning at work” which prioritises learning and development over productivity.
Enter the AI the study mate
The paper proposes the need for a new sort of AI tool for “learning at work”. Researchers call them AI learning companions. I think of them as study mates — the difference between a tutor who turns up to your session never having met you and a friend who knows where you got stuck last week, what you said about it, and what you’re trying to learn next.
Adaptive, pedagogically informed, LLM-powered agents designed for sustained integration into learning environments and explicitly engineered to prioritise durable learning over short-term performance.
The distinction between tutor and study mate matters. The “tutor” framing — the one that’s dominated AI-in-education since GPT-4 — treats every interaction as transactional and one-off. You arrive, you ask, you get an answer, you leave. The “study mate” framing treats the relationship as developmental and cumulative. The AI knows you, it remembers what you’ve worked through in the past, it adapts what comes next based on what’s already happened.
That’s not a prompt change, it’s shift in how we build AI tools for learning built on two core foundations.
1. The pedagogical foundation — how human learn with AI
The companion’s job is to make the student do the thinking, not to present information well. A companion that explains clearly but lets the learner remain passive is, as the authors put it, “pedagogically no better than a well-produced video”.
Instead, we need to build products which deliver four critical conditions:
deep and interactive learning (generation, retrieval, desirable difficulties);
guided scaffolding (productive struggle, balanced with support);
learning-to-learn (metacognitive calibration, comparing what learners think they know against what they can actually do);
contextual learning (situating the work in authentic disciplinary practice).
2. The adaptive foundation — how AI learns about the human
This is the bit almost no current tool gets right. Most LLM applications are “one off interactions” which start fresh every time. The AI knows nothing about what the learner has done before, what they’ve struggled with, what they’ve mastered.
To support human learning, we need to build AI products which deliver a four-stage cycle:
Capture the learner’s digital footprints;
Model them across cognitive, affective and behavioural dimensions;
Adapt both the curriculum and the within-task feedback;
Evolve the system itself through closed-loop evaluation.
Overall, this is the design consequence counterpart to the offloading paradox. The U-curve research told us how learners should use AI — this “companion framework” tells us how AI should be built to make that possible.
What this looks like in the wild
The concept of the “AI study mate” is not entirely new. The paper profiles five existing flagship implementations which prove the hypothesis in practice. Three of these stood out to me as especially relevant for us as L&D professionals:
1. RiPPLE (University of Queensland)
Picture a study mate that doesn’t just answer your questions — it watches what you create, what you struggle with, and what your peers think of your work, and uses all of that to decide what to do next.
That’s what RiPPLE is. Students write their own practice questions and worked examples, peer-review each other’s, and then practise from the validated set. A companion sits alongside them through all of it — nudging them to justify their thinking when they’re creating, helping them give better feedback when they’re reviewing, and surfacing past misconceptions when they’re practising.
What makes RiPPLE different from almost every other AI tool in L&D right now is what’s happening under the bonnet. It builds a real, evolving picture of each learner — what they know, what they’ve got wrong before, how they compare to their peers, how engaged they are — and the companion uses that picture to shape every interaction. It doesn’t just respond. It adapts.
Of the five tools the paper looked at, RiPPLE is the only one that does this properly. Which tells you something uncomfortable about the state of the market: across five of the most pedagogically considered tools currently in deployment, the bit that actually makes a study mate a study mate — the part that learns about you over time — is largely missing.
2. Recast (University of Technology Sydney)
Recast isn’t a single companion — it’s a kit for building them. UTS faculty use it to design and deploy their own AI companions for their own courses. Three examples give you a feel for how they are used to support the learning experience:
Reflection Assistants that help students think through what they’ve learned and where they’re stuck before submitting written reflections
Critical Thinking Assistants that use Socratic questioning to surface the assumptions hiding inside a student’s research question
Role-Play Assistants that voice-act as a difficult patient or a tricky client, so Health and Law students can practise hard conversations before they have them for real
3. Khanmigo
The Khan Academy team started out designing Khanmigo as an AI Socratic tutor — never just giving the answer, always asking questions first. When they read the transcripts, they soon saw a problem: students were getting frustrated and dropping out.
Endless questioning when you genuinely didn’t know the answer doesn’t build understanding — it builds disengagement. Some students were giving up and walking away from the tool entirely.
So, they started to rebuild the product using study mate principles. Now, Khanmigo encourages students to make an attempt, gives a hint if they’re stuck, and only walks through a worked example after they’ve genuinely tried. It meets the learner where they are, rather than performing the role of tutor regardless of what the learner actually needs.
The lesson from all three examples is twofold:
First: an AI tutor approach — even a pedagogically sophisticated one like Socratic questioning — can actively harm learning when the design ignores what the learner needs in the moment. Asking questions of someone with no foothold isn’t teaching. It’s gatekeeping.
Second: a study mate approach — adapting to the learner, meeting them where they are, fading support as they get stronger — isn’t just theoretically better. It’s empirically better. The Khan team didn’t get there by reading a paper. They got there by watching students fail and rebuilding the tool around what the learners actually needed.
This is the move every L&D professional building or buying AI tools should be taking note of: read the transcripts, and focus on study mate models, not tutor models of learner-AI interactions.
How current AI tools actually score against the spec
So if we were to build an optimal “AI companion” for learning, what would it look like?
Pulling the paper together with the offloading paradox research and a decade of intelligent tutoring system work, here’s the spec. Eight capabilities AI needs to actually enable substantive learning:
Withholds answers architecturally, not just by prompt
Persistent multi-dimensional learner model, across sessions and tools
Trajectory adaptation — adjusts what comes next based on the model
Metacognitive calibration — surfaces the gap between confidence and performance
Faded scaffolding — provides less support over time, not more
Errors as diagnostic data, not failures to correct
Delayed-outcome evaluation — measures retention and transfer, not satisfaction
Open learner model — what the system thinks you know is visible and contestable
Here’s how the categories of tool currently on the market score against that spec:
The story this tells is uncomfortable.
Default LLMs — what most of your learners actually use in the process of learning — score around 0.5 out of 8. They are productivity tools. Used for learning, they produce exactly the failure mode the paper describes.
Study modes within LLMs — less powerful than you might think. These sit in between at around 3.5 — better than default LLMs, but not by anywhere near enough to support substantive learning.
Intelligent tutoring systems (ITS) — these are AI-powered learning platforms that have been around since the 1980s. They were built to do exactly what we're now talking about: model what a learner knows, adapt the next problem to their current level, fade support as they master a concept, and only let them progress once mastery is demonstrated. Decades of research have shown that they produce learning outcomes close to those of one-to-one human tutoring.
By the standards of the study mate framework, ITSs are by far the closest thing the market has ever produced to a real learning companion. The catch is that they only work in narrow, well-modelled domains — maths, statistics, basic chemistry — where the knowledge can be mapped to a structured skill tree and where there’s a clean right-or-wrong answer to every problem. They don’t yet scale to open-ended capability development, soft skills, professional judgement, or the kind of messy real-world reasoning most workplace learning actually involves.
This is the gap LLM-powered study mates have the potential to close: pairing the conversational flexibility and domain range of LLMs with the learner modelling and pedagogical depth that ITSs have been quietly doing for forty years.
The headline finding: the closest thing we have to a learning-optimised AI is a decades-old technology that doesn’t scale. The newest, most-funded AI tutoring tools are the furthest away from the spec required to actually make learning happen.
The harder question this paper forces us to ask
If you read the paper carefully — and especially if you look at where the real innovation is happening — one pattern is impossible to ignore.
Every meaningful experiment in AI study mates is happening in education, not in workplace learning. RiPPLE is at the University of Queensland. Recast is at UTS. Khanmigo is at Khan Academy. CodeHelp is in computer science classrooms. JeepyTA is in higher ed seminars. The five flagship implementations in the paper are all from K–12 or higher education.
Not one is a corporate L&D tool.
Meanwhile, the AI tools being sold into workplaces — including most of what L&D teams are being pitched — are overwhelmingly built on the AI tutor model which does more to support productivity than substantive learning and development.
This forces an uncomfortable question: do workplaces actually want to build tools for human learning at work? Or do we just want AI tools which help to get the work done?
That isn’t a technology problem — it’s a values problem. Schools and universities have decided that substantive learning is the outcome, and they’re building tools accordingly. Workplaces have mostly decided throughput is the outcome and - so far at last - they’re investing in AI tools accordingly.
If we want that to change, L&D has to stop being the team that buys the tool after procurement has chosen it. We have to be in the conversation about what we’re optimising for in the first place. Capability or completion. Learning or performance. Growth or output.
Those aren’t the same thing. The Bastani study at the start of this post is the proof.
The L&D field has spent two years asking whether to use AI. The question for 2026 is sharper: what are we actually using it for?
So what now?
Here’s what gives me hope.
The research is increasingly unequivocal: AI tools that prioritise learning over performance can be built, and do work. RiPPLE has proven it at scale across 80,000 students. Carnegie Learning and ALEKS have been quietly producing one-to-one tutoring outcomes for forty years. The Khan team rebuilt Khanmigo in real time when the data told them to.
The technology exists. The pedagogy exists. The proof exists.
What’s missing in the workplace isn’t capability. It’s intent.
So the call to action for L&D in 2026 isn’t to wait for the perfect study mate to arrive in your inbox. It’s to decide — explicitly, deliberately, in the conversations you’re having about AI in your organisation right now — that learning is the outcome you’re optimising for. Not throughput. Not completion rates. Not “AI literacy” measured in tools-used-per-week. Capability that persists when the AI is gone.
If we make that choice, the tools will follow. If we don’t, we’ll get exactly what we’re being sold: faster work, shrinking workers.
Happy building,
Phil 👋
PS — Want to learn more about how get optimal value from AI with help from me? Check out my AI Bootcamp for L&D.




