Scaling Evidence-based Instructional Design Expertise Using AI

Deep-diving a recent study by Carnegie Mellon University

Jan 16, 2025

Hey folks! 👋

This week, I've been diving into some fascinating research published by Carnegie Mellon University in 2023 about AI in instructional design.

While most people were talking about using AI to generate content or automate grading, this research pointed to something way more interesting: using AI to scale evidence-based instructional approaches that have traditionally been too resource-intensive to implement broadly.

In this week’s blog post, I’ll summarise the research and share the implications for when and how we, as instructional designers, might work with AI to optimise its impact.

Let's go! 🚀

The Research Deep Dive

Gautam Yadav's team at CMU tackled a crucial question: Can we use Large Language Models (LLMs), specifically GPT-4, to scale evidence-based instructional design expertise? Their goal was to solve one of the most wicked problems in our field: bridging the gap between what educational research tells us works and what we can actually implement in practice.

To explore this question, they conducted two fascinating experiments that really push the boundaries of what's possible with AI in instructional design:

Scaling Evidence-based Instructional Design Expertise through Large Language Models (June, 2023)

Experiment 1: Scaling Scenario-Based Learning Design

The first experiment tackled an e-learning course called "E-learning Design & Principles." The goal? Help students understand when and how to apply 30 different instructional principles effectively.

The team used a predict-observe-explain (POE) approach - where students:

Predict what would happen in a teaching scenario
Explain their thinking
Observe what actually happens
Explain what they learned

Here's where it gets interesting. Instead of creating all 30 scenarios from scratch, they used an exemplar within a prompt to rapidly generate them with AI. In practice, this meant that they:

Created one perfect case study of a "Worked Example" activity
Used it as part of the system prompt GPT-4 to generate multiple worked examples for multiple scenarios

By doing this, the team cut development time by more than half while maintaining quality by integrating a human-in-the-loop, i.e. having an expert review all outputs.

Experiment 2: Scaling Active Learning Design

In the second experiment, the team’s goal was to develop hands-on programming assignments for a new Learning Analytics course.

Specifically, the team wanted to design "learn-by-doing" programming assignments in Jupyter Notebook to teach predictive modelling in Python.

The team developed a four-step process where they worked with GPT-4 like a collaborative partner. Here's what each step looked like:

First they worked with GPT-4 to brainstorm some initial exercise ideas:

Can you give me 2 examples of hands-on exercises for "Implement a predictive model using Python" in Classifiers?

Once they had exercise ideas, they asked GPT-4 to write out the actual code for the exercise - i.e. they fleshed out the lesson plan with all the specific details:

Provide a worked example in Python for [previous exercise output]

Third, they tested the code in an online programming environment (Google Colab - think of it like Google Docs but for code). When they found problems, they went back to GPT-4 to solve them. They usually needed 3-5 rounds of back-and-forth to get everything working properly:

This code isn't working because [specific error]. Can you help fix it?

Finally, the team asked GPT-4 to turn the working code into interactive exercises:

Create practice activities with automated checks that will tell students if their code is working correctly

Interestingly, unlike their first experiment where showing GPT-4 one good example was enough, with programming exercises the team found that they needed to show GPT-4 multiple examples to get good results.

This difference shows us something fascinating about how AI handles different types of instructional content. In the first experiment, the team was creating scenario-based questions that followed a clear, consistent structure. Think of it like giving AI a recipe - once it has one good example of predicting, explaining, observing, and reflecting, it can apply that same pattern to different topics.

The programming exercises were different. They involved multiple required approaches, various coding patterns, and different ways to test if code is working. Put simply, to get this right AI has to “understand” and reproduce multiple strategies, and for this it needs additional “instruction” and multiple examples to generate a clear picture of what "good" looks like.

This observation is particularly valuable for instructional designers because it shows us that different types of learning content and activity need different approaches when working with AI. When working with generic LLMs like GPT-4, there is no one approach which works in all contexts; sometimes one perfect template is enough; other times, you need to show multiple examples to get the results you want.

Implications for Instructional Designers

The researchers' experiences with these two experiments reveal, I think, three key principles for working effectively with AI in instructional design.

Let's look at each one and see how it changes our work:

1. The Power of Templates

Before AI: Creating variations of learning activities (like scenario-based questions or case studies) meant writing each one from scratch. For example, developing 30 different scenarios for teaching instructional principles would take weeks of work, with each scenario requiring careful crafting to maintain consistent quality and learning value.

With AI: As the CMU team showed, we can now create one "perfect" template and use it to generate multiple high-quality variations. The key is getting that first template exactly right - documenting what makes it effective and using that as a foundation.

In Practice: I recently saw this work brilliantly with customer service training. We spent a day crafting one perfect scenario about handling an angry customer - getting the emotional hooks right, creating realistic dialogue, developing carefully crafted multiple-choice options, and writing learning-focused feedback. Then we used this as a template with GPT-4 to generate variations for different situations - technical problems, billing issues, product returns - while maintaining the same pedagogical structure and quality level. What previously took weeks now takes days, with consistent quality across all scenarios.

2. Start Small, Scale Smart

Before AI: Rolling out new course content typically meant choosing between quality and scale. You could either create high-quality content for a small portion of your course or create lower-quality content that covered everything. There wasn't really a middle ground.

With AI: The research shows we can start with one high-quality module, perfect our approach with AI, then scale that success across an entire course. It's like having a trusted colleague who can replicate your best work, once you've shown them exactly what you want.

In Practice: For example, when designing a project management course, start with just the "stakeholder management" module. Create one solid stakeholder analysis exercise with GPT-4, test it with a small group of learners, gather feedback, and refine your prompts. Once you've got that working well, expand to other modules like risk management or scope planning. Each iteration builds on what worked before, helping you develop better "AI instincts" about what kinds of prompts work best and where human expertise is most crucial.

3. Quality Control is Key

Before AI: Quality control often meant choosing between thoroughness and speed. Comprehensive review processes were time-consuming, while rapid reviews risked missing important issues.

With AI: The CMU experiments show we can maintain high standards while working faster by implementing structured review processes that complement AI's capabilities. Just as they used expert review cycles to verify AI-generated content, we can develop systematic approaches to quality control.

In Practice: This might look like a three-stage review process for AI-generated content. For instance, when creating business ethics case studies:

First Review: Check for factual accuracy and realism
Second Review: Verify pedagogical effectiveness (Are learning objectives clear? Is difficulty appropriate?)
Final Review: Ensure inclusive language and accessibility

Create checklists for each stage and document common issues to refine your prompts over time. This way, you're not just checking quality - you're continuously improving your AI collaboration process.

The Big Questions: Moving from Generic to Specialised AI?

Something that really struck me from this research is the crucial role of instructional design expertise in getting value from AI. As I’ve mentioned before, while tools like GPT-4 are incredibly powerful, they're also generic. Working with GPT-4 is like having a very smart assistant who knows something about everything but hasn't specifically studied instructional design.

The CMU team's success didn't come just from using GPT-4 - it came from knowing how to "teach" it instructional design principles and how to validate its outputs. This highlights something crucial: to get real value from current AI tools, you need deep instructional design expertise. Only by knowing what good learning design looks like can you write optimised prompts and validate AI’s outputs.

But here's where it gets really exciting: imagine what we could do with AI tools built specifically for instructional design. We're starting to see this emerge in two key trends:

Specialised LLMs for Education: Google's Learn LM, for example, is being built with deep understanding of educational principles baked in. Unlike generic LLMs, these specialised models understand concepts like:

Learning progressions
Knowledge scaffolding
Assessment alignment
Cognitive load theory

AI Copilots for Instructional Design: Tools like Epiphany (coming soon!) are being designed specifically to help instructional designers work faster and better. Think of them as intelligent design colleagues who can:

Suggest appropriate learning strategies based on your objectives
Help apply theoretical frameworks in practical ways
Identify potential issues in learning designs
Provide evidence-based recommendations for improvements

The difference between these specialised tools and generic LLMs is like the difference between having an eager but inexperienced apprentice and having an experienced instructional design colleague. While both can be helpful, specialised AI makes a more significant impact on the speed, quality and impact of the work of specialists - something already proven by similar developments in the worlds of coding and medicine.

Moving Forward: What You Can Try Today?

To get back to the here-and-now: here's a concrete example to get you started in your experimentation of GPT-4.

Example 1: Scaling Multiple-Choice Questions

Let's say you regularly create multiple-choice questions for your courses. Start by selecting your best question - one that really tests understanding rather than just recall. Document what makes it effective: this might be the stem structure, how the distractors work, the quality of the feedback.

Here's what this might look like:

You are an expert instructional designer. Your task is to create multiple-choice questions that test decision-making in project management. You must use the following question as a template [insert question]. Each question must:

Present a realistic scenario
Test application rather than recall
Include plausible distractors based on common misconceptions
Focus on initial steps in problem-solving

Generate 3 new questions following this pattern but for different project management challenges.

Example 2: Scaling Scenario-Based Learning

Let's say you're creating customer service scenarios. Start with your best scenario - one that effectively teaches a specific skill through a realistic situation. Then, use this to “teach AI” how to reproduce it:

You are an expert instructional designer specialising in customer service training. Using the following scenario as a template [insert scenario], create three new scenarios that:

Present different emotional challenges (e.g., confusion, disappointment, anger) - Include realistic customer dialogue
Focus on different service issues (technical, billing, product)
Maintain the same level of emotional complexity
Include 2-3 response options for each scenario, explaining why each response would be effective or ineffective
For each scenario, include: 1. The initial situation 2. The customer's opening statement 3. Response options with explanations 4. Learning points that connect to customer service best practices

In both cases:

Start small with one type of content you create regularly
Document what makes your best example effective
Be specific in your prompt about what makes it work
Test and refine based on results

The key is to begin with content you know well - this helps you evaluate AI outputs effectively and refine your approach. As you get comfortable with one type of content, you can expand to others, building on what you've learned.

Remember: your instructional design expertise is crucial here. You're not just asking AI to generate content - you're teaching it your best practices and using your expertise to validate and improve its outputs.

Happy designing! 👋

PS: Want to dive deeper into AI and instructional design? Apply for a place my AI & Learning Design Bootcamp where we explore these concepts and get hands-on to develop our AI skills.

Dr Phil's Newsletter, Powered by DOMS™️ AI

Discussion about this post