How Good are Claude, ChatGPT & Gemini at Instructional Design?
A test of AI's Instruction Design skills in theory & in practice
This week, I started work on a research project to explore the effectiveness of Large Language Models (LLMs) like ChatGPT, Claude, and Gemini in instructional design.
As research by people like Donald H Taylor and Egle Vinauskaite shows, more instructional designers are using LLMs like ChatGPT, Claude and Gemini to complete learning design tasks than ever before - and the numbers seems to be increasing at a rate of knots.
These models are increasingly popular tools for learning design tasks like writing objectives, selecting instructional strategies and creating lesson plans. With their ability to to all of these things quickly, general-purpose AI models might seem an ideal source of instructional design support.
The question I have is: how well do these generic, all-purpose LLMs handle the nuanced and complex tasks of instructional design? They may be fast, but are AI tools like Claude, ChatGPT, and Gemini actually any good at learning design?
In this week’s post, I share the initial findings from my research which compares how different, commonly used LLMs handle both the theoretical and practical aspects of instructional design.
By examining models across three AI families—Claude, ChatGPT, and Gemini—I’ve started to identify each model's strengths, limitations, and typical pitfalls.
Spoiler: my findings underscore that until we have specialised, fine-tuned AI copilots for instructional design, we should be cautious about relying on general-purpose models and ensure expert oversight in all ID tasks.
Let’s go 🚀
Research Overview: Objectives and Methodology
In this study, I set out to assess seven LLMs which I know my from own research are commonly used by instructional designers: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku, ChatGPT 4o, ChatGPT o1 preview, ChatGPT o1 mini, and Gemini by Google.
My research centres on two research questions:
Theoretical Knowledge of Instructional Design – How well does each LLM understand and articulate core instructional design principles? To what extent do these models reference and align with contemporary instructional design frameworks and demonstrate awareness of both foundational and advanced pedagogical strategies?
Practical Application of Instructional Design – How effectively does each LLM apply its stated theoretical knowledge in practice?
To ensure a systematic analysis, I asked each LLM the following questions, designed to gauge both their understanding of instructional design theory and their ability to apply it practically:
Phase 1: Theoretical Understanding Assessment
To evaluate each LLM’s instructional design knowledge, I asked each one the following four questions:
Process Understanding: “How do you approach instructional design tasks?”
This question aimed to uncover each model’s claimed methodology and foundational approach to instructional design principles.
Strategy Selection: “How do you determine appropriate learning strategies?”
Here, I assessed each model's decision-making process and awareness of various evidence-based instructional strategies.
Theoretical Framework: “What learning theories inform your design choices?”
This question probed the depth of each model's theoretical knowledge, including references to established research, theories and frameworks.
Epistemic Reliability Test: “How do you account for different learning styles?”
This question tested each model’s critical thinking and awareness of current educational research, including its ability to recognise and challenge outdated and debunked theories.
Phase 2: Practical Application Assessment
To evaluate how much each LLM applied its instructional design knowledge in practice, I asked each one to generate for instructional designs:
Password Security Lesson – A technical content lesson with clear success criteria and multiple learning objectives.
Spreadsheet Skills Instruction – A procedural knowledge lesson requiring clear, step-by-step guidance and an emphasis on skill application.
Colour Theory Lesson – A creative content lesson with abstract concepts, engaging multiple learning modalities.
Photography Composition – A lesson focused on visual learning and practical application, requiring attention to diverse learner needs.
I then analysed each model’s responses to assess theoretical accuracy, practical feasibility, and alignment between theory and practice.
Here’s what I found…
Findings: Theory vs. Practice for Each LLM Model
The Claude Family (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku)
A. Theoretical Knowledge
How well did the Claude family perform when asked about its understanding of instructional design principles?
Claude 3.5 Sonnet: Claude 3.5 Sonnet demonstrated a robust foundational understanding of instructional design, referencing principles like learner-centered design, scaffolded learning, and elements of the ADDIE model. It structured its responses with attention to practical ID elements such as engagement, memorability, and assessment.
For example, in a password security lesson, it included an engaging hook by asking, “How long do you think it would take a computer to crack these?” to make the material relevant from the outset.
However, while Claude 3.5 Sonnet showed a solid understanding of foundational ID principles, it missed a number of critical instructional strategies. It also failed to address the need to customise approaches based on specific learner needs, goals, and constraints, a key aspect of effective instructional design.
Claude 3.5 Sonnet’s treatment of learning styles was somewhat nuanced but limited. While it acknowledged learning style preferences, it emphasised that these should not overshadow evidence-based methods. For instance, it suggested accommodating preferences through visual aids or step-by-step breakdowns but avoided endorsing strict adherence to learning style categories
Sonnet’s discussion of specific theories was superficial at best. While it identified the important of Cognitive Load Theory, for example, it did not explicitly differentiate between intrinsic and extraneous cognitive load revealing gaps in its understanding of ID.Claude 3 Opus and Haiku: These variants displayed a much more limited grasp of instructional theory, primarily covering only a small number of basic principles without no. These models missed a lot of what Sonnet 3.5 picked up on, including constructivist approaches which are critical and foundational to high quality instructional design.
When it came to learning styles, Claude 3 Opus stated that it was unable to account for learning styles due to its limitations as an AI assistant, while Claude 3 Haiku similarly acknowledged that it lacked the ability to accommodate different learning styles or personalised teaching strategies — I’ll take that as a “pass”….
B. Practical Application
How well did the Claude family apply what it said in theory in practice?
Claude 3.5 Sonnet: Applied its theoretical knowledge relatively well. In a password security course, for example, the model structured the course into timed segments, provided clear objectives, and built complexity gradually by guiding learners to create and assess passwords. However, its designs did not reflect the full range of considerations and understanding that were implied by the model’s theoretical grasp of ID.
Claude 3 Opus and Haiku: Unsurprisingly, these models produced simpler, more generic lesson structures that were effective for foundational tasks but not sufficiently nuanced for complex, differentiated instruction or adaptive learning paths. Again, its designs did not reflect the full range of considerations and understanding that were implied by the model’s [limited] theoretical grasp of ID.
2. ChatGPT Family (ChatGPT 4o, ChatGPT o1 Preview, ChatGPT o1 Mini)
A. Theoretical Knowledge
How well did the ChatGPT family perform when asked about its understanding of instructional design principles?
ChatGPT 4o: ChatGPT 4o displayed a broad understanding of key theoretical knowledge, referencing instructional theories such as Constructivism, Social Learning Theory, and Cognitive Load Theory.
However, ChatGPT 4o often lacked practical understanding of how to adapt theories to specific learner contexts like class size, resources, or time constraints, which are crucial in instructional design.
It also showed a lack of awareness of a number of important instructional theories, limiting its potential for creating optimal, inclusive and accessible designs. It also failed identify how different strategies might be required depending on the learners, context, topic, goal etc.
ChatGPT 4o recognised learning styles and actively cautioned that while it’s important to offer varied teaching methods, rigidly adhering to learning styles could be limiting. This nuanced response perhaps demonstrated some awareness of the learning styles debate but lacked opinionated direction.ChatGPT o1 Preview and Mini: These models mirrored ChatGPT 4o in theoretical breadth but had even less conceptual depth. They were able to identify a limited number instructional strategies but did not explain them, suggesting a surface-level understanding.
In the case of learning styles, ChatGPT o1 Mini explicitly mentioned common categories such as visual, auditory, and kinesthetic learning without questioning their validity.
B. Practical Application
How well did the ChatGPT family apply what it said in theory in practice?
ChatGPT 4o: Despite its theoretical depth, ChatGPT 4o’s designs were either very basic or over complex. It did, however, apply some of its theory in practice. In a colour theory course, for example, it applied Constructivism by prompting students to think about why they’re drawn to specific colours.
However, most of what ChatGPT 4o promised in theory was not delivered in practice. It also often lacked practical understanding of how to adapt theories to specific learner contexts like class size, resources, or time constraints, which are crucial in instructional design.ChatGPT o1 Preview and Mini: These models lacked the depth to create meaningful instructional designs. Their outputs were overly simplistic and required substantial customisation to align with specific instructional needs.
3. Gemini Family (Gemini)
A. Theoretical Knowledge
How well did Gemini perform when asked about its understanding of instructional design principles?
Gemini: Demonstrated a very a limited grasp of instructional theories, often lacking reference to even the most basic ID concepts like differentiation. It showed no awareness of the need to adapt strategies to different learner needs or experience levels.
The model also did not acknowledge any active or experiential learning theories including Constructivism. Instead, it demonstrated an a traditional “chalk and talk” knowledge-transfer understanding of instructional design which is common but never optimal.
When it came to learning styles, Gemini took a basic stance, acknowledging learning style categories (visual, auditory, reading/writing, kinesthetic) without critically evaluating their validity.
B. Practical Application
How well did Gemini apply what it said in theory in practice?
Gemini: Despite an apparent lack of awareness of active and constructivist approaches in theory, Gemini’s practical output was notably active and constructivist. For example, in a photography composition lesson, it included only basic concepts without adapting strategies to different learner needs or experience levels, but - in contrast to the approach it described in theory - the session that it designed was active, hands-on and problem based.
Practical Takeaways for Instructional Designers Using Generic LLMs
So, what does all of this mean for the growing number of instructional designers who use generic LLMs and AI tools in their day to day work?
Here are my top three take aways so far:
Use Structured Prompts to Guide Generic LLMs: While LLMs can identify some foundational ID strategies, they often need guidance to apply them meaningfully. Provide the model with detailed, structured prompts that specify your instructional goals, desired learner engagement levels, and content format. Explicitly request it to avoid rigid adherence to outdated concepts, like learning styles, and to suggest active learning or constructivist methods where applicable.
Customise AI Outputs for Your Learner Contexts: Generic LLMs may produce content that lacks context-specific nuances, such as adaptations for different learner backgrounds, class sizes, or instructional constraints. Designers should view AI outputs as a starting point, refining and contextualising these to fit the specific needs, constraints, and diversity of their learners.
Mitigate Risks of Outdated Theories: Generic LLMs may reproduce outdated or debunked theories without critical evaluation. Instructional designers should be cautious about accepting suggestions based on learning styles, rote memorisation, or overly simplistic methods. Counterbalance these outputs with evidence-based frameworks, and ensure AI recommendations align with contemporary instructional practices.
Conclusion: Generic Vs Specialised AI in Instructional Design
The headline is that across all generic LLMs, AI is limited in both its theoretical understanding and its practical application of instructional design - but why is this?
Here’s the TLDR:
Lack of Industry-Specific Knowledge and Nuance: Without training on specialised domain data, general-purpose LLMs lack the domain specific knowledge to support optimal decision making. In industries like coding and medicine, we’re seeing the emergence of fine-tuned AI copilots, such as GitHub Copilot, Cursor for coders and Hippocratic AI. These specialised models are trained on specialised data, perform narrow, industry-specific tasks, and support professional workflows effectively.
Uncritical Use of Outdated Concepts: Generic AI models are trained on large datasets, meaning they reproduce that which is more common rather than that which is optimal. In practice, this means that are at risk of recommending and reproducing outdated & debunked theories, e.g. learning styles.
Superficial Theory Application: Without explicit prompting, generic LLMs often fail to apply what they “know” in theory meaningfully in practice. The result is the creation of either generic or impractical designs that require significant human adjustment.
While general-purpose AI models like Claude, ChatGPT, and Gemini offer a degree of assistance for instructional design, their limitations underscore the risks of relying on generic tools in a specialised field like instructional design.
In industries like coding and medicine, similar risks have led to the emergence of fine-tuned AI copilots, such Cursor for coders and Hippocratic AI for medics. These models, designed for deep expertise in specific domains, are far more reliable for complex tasks and were born out of a realisation that, without these tools, coders and medics could not leverage the real power and potential of AI to increase both their efficiency and their effectiveness.
Until we see similar specialised AI tools tailored to the nuances of instructional design principles, practices and processes, instructional designers should be mindful of the limitations of generic LLMs.
These tools will undoubtedly make your work faster, but they also risk making us more effective at ineffective practice.
TLDR: In a world where we work more and more with AI, developing deep instructional design expertise has perhaps never been more important.
Happy innovating!
Phil 👋
PS: If you want to get hands-on, hone your instructional design knowledge and learn how to get the most out of both generic and specialised AI tools supported by me, apply for a place on my AI Learning Design Bootcamp.