AI Model Selection for Instructional Design

A data-informed field guide to select the right LLM for the right task

Jul 11, 2025

Hey folks!

Over the last three years, AI has emerged a massively popular tool among Instructional Designers. The vast majority of us are now using Generative AI is every day across the end to end Instructional Design workflow, from summarising dense policy documents into neat structured summaries, to brainstorming design ideas, formulating objectives and drafting entire course storyboards.

But - to quote Spiderman - with great power comes great responsibility.

With great [AI] power to accelerate Instructional Design comes great responsibility to choose the right model for the right task

From some initial research conducted over the last month, I’ve found that the vast majority of Instructional Designers choose their AI tools based on UX preference and cost.

Meanwhile, in the workplace, most L&D teams are allocated AI tools based on the existing tech stack rather than tool performance. TLDR: If your org already uses Microsoft Office Suite and Teams, you will more than likely be allocated MS Copilot as your AI sidekick for the L&D workflow.

However, research shows very clearly that we need to be much more intentional about which tools and models we use (and which we avoid), for specific tasks within the end to end Instructional Design workflow.

In this week’s post I help you understand which models to use for which tasks, what pitfalls to watch for, and how help you on a journey to achieving more value and impact when working with AI.

Let’s go! 🚀

Understanding the AI Model Landscape: Three Tiers of Performance

Selecting the right AI model for the right task is quickly becoming a foundational skills for instructional designers, directly influencing the quality, reliability, and efficiency of your work.

With a rapidly expanding array of AI tools available, it’s easy to be swayed by brand recognition, cost, or convenience. However, not all models are created equal—each has unique strengths, limitations, and ideal use cases.

To help you navigate this landscape, it’s helpful to think in terms of three distinct tiers:

Premium Tier: These are the most advanced, accurate, and costly models, excelling at complex, high-stakes instructional design tasks.
Mid-Range Tier: Offering a balance between performance and affordability, these models are reliable for drafting and iterative work, provided you have strong review processes.
Budget Tier: These models are accessible and fast, making them ideal for brainstorming and creative ideation, but they require careful oversight due to higher error and hallucination rates.

1. Premium Tier: The Heavy(ish) Hitters

Claude Opus 4, GPT-4.5, and Gemini 2.5 Pro represent the current gold standard in AI-assisted Instructional Design.

In a controlled test, these models had the highest and most consistent rates of pedagogical accuracy (86–89%) in their responses, while their advanced reasoning capabilities make them the best choice for high-stakes deliverables where errors of fact could have significant negative consequences for learners and for legal compliance.

Benchmarking the Pedagogical Knowledge of Large Language Models (June, 2025)

The benefits of premium models come at a cost—literally. With token prices ranging from $15 to $75 per million (Claude, GPT-4.5) and $1.25 to $15 per million (Gemini), the approximate cost per interaction (assuming 200 tokens) is about $0.003–$0.015 for premium models and $0.000025–$0.00015 for mid-tier models. This means that working with a Premium model costs roughly 20 to 600 times more per interaction than working with a basic model.

Even at this tier, no model is infallible; roughly 10% of all responses will be entirely incorrect and research suggests that accuracy drops by a further ~3% when designing more specialised content, e.g. content for learners with Special Educational Needs and Disabilities (SEND).

TLDR: Individuals and organisations must weigh the benefits of increased accuracy against costs and the ongoing need for rigorous oversight and strict fact-checking protocols.

2. Mid-Range Tier: The Workhorses

Mid-range models like Qwen3-32B and Mistral Medium 3 offer a pragmatic balance between cost and performance. They are pretty well-suited for accelerating time consuming day-to-day tasks like drafting scripts, job aids, and other materials that will undergo review and revision.

With pedagogical accuracy between 70% and 82%, these models moderately reliable when it comes to specialised Instruction Design tasks but - as with any AI model - should never be left unattended to make Instructional Design decisions.

Mid-tier models are more affordable ($0.40–$2.50 per million tokens) and fast, making them ideal for organisations that need to produce content at scale without breaking the bank.

However, compared with premium models, they are more also more prone to numeric errors, jargon drift, and accuracy drops of ~12 percentage points when working on specialised topics.

TLDR: Mid-range models are the reliable workhorses of content production. They excel at speed and volume but depend on strong review processes to ensure quality.

3. Budget Tier: The Idea Generators

GPT-3.5 Turbo and Llama 3 8B are two examples of widely accessible and cost-effective (free to $0.50 per million tokens). However, these models are especially prone to hallucinations, with up to 50% error rate in their responses.

Metas LLAMA 3 is INSANE! Review and Comparison vs GPT4 and other AI models | by TrendiTec | Medium — Llama 3, a budget AI model

Budget tier models also have a much lower rate of pedagogical accuracy, with just 28–52% of pedagogical questions being answered correctly in controlled tests. They also have much more dramatic drops in accuracy drops (up to 18%) when dealing with specialised topics e.g. content for learners with Special Educational Needs and Disabilities (SEND).

These models are best reserved for tasks like brainstorming, creative ideation, and generating lists of ideas or metaphors—not for producing final, learner-facing content.

TLDR: Budget models behave like energetic but inexperienced apprentices—great for generating ideas, but not to be trusted with final drafts or critical fact finding or checking.

AI’s Four Critical Failure Modes: What the Data Reveals

No matter which model you choose, research shows that certain types of errors and risks will appear with striking regularity when working with AI. Recognising AI’s most common "failure modes" is essential for building effective quality assurance into your AI-powered workflow. Here’s what we know right know about where and how AI models struggle:

1. Chronology Mix-Ups: When Time Gets Twisted

AI models struggle with temporal reasoning - i.e. the ability to understand and process information about time and its influence on events, relationships, and situations.

Even top models like GPT-4 only score about 84% on the TRAM benchmark—10 points below humans—and all I performance seems degrade over the course of an interaction. Research suggests that many LLMs, including Gemini, Llama, Mistral, and Qwen anchor "today" to outdated training data and regularly mis-order events in complex tasks.

TRAM: Benchmarking Temporal Reasoning for Large Language Models (August, 2024)

In practice, means that all models will likely miss some critical temporal information—like specific dates, order of events, or time intervals—and misrepresent time by incorrectly ordering events, confusing past and present, or anchoring "today" to outdated or incorrect dates, such as its training cutoff (e.g., November 2023).

Over the course of an interaction, this performance degradation manifests as the model increasingly losing track of the temporal context, leading to inconsistent or inaccurate timelines, which can cause it to misremember or misstate the sequence of events, resulting in outputs that are subtly or overtly incorrect about when things happened or how they relate chronologically.

TLDR: time-related data in documentation (timelines, historical sequences, processes) can be subtly and in some cases dramatically wrong. Chronology errors are easy to miss but can have major consequences for design accuracy. Always triple-check the order of events in any output produced by AI, and be especially cautious if your interactions and/or input documents are lengthy.

2. Over-Generalisation: The Confidence Trap

AI models tend to make broad, confident claims that often omit important caveats or nuances. These over-generalisations can mislead users and propagate misconceptions.

Research shows that LLMs are up to five times more likely than human experts to over-generalise scientific findings, transforming narrow, population-specific research results into broad, universal claims that can mislead users about the scope and applicability of information.

Generalization bias in large language model summarization of scientific research (March, 2025)

In practice, over-generalisation surfaces as the systematic removal of critical caveats, limitations, and nuances from source material.

All AI models will routinely transform qualified statements like "may reduce symptoms in 60% of patients in this study" into absolute claims like "reduces symptoms." This occurs because AI models are trained on vast datasets where confident, definitive language is often rewarded, leading them to adopt a default strategy of sounding authoritative even when uncertainty is more appropriate.

Prompt phrasing can significantly amplify this problem. When explicitly prompted for "accuracy" or "definitive" answers, LLMs have been show to actually reduce their factual accuracy by about 7% compared to neutral prompts. Paradoxically, requests for certainty trigger more over-generalisation, not better accuracy. This pattern appears across model families—ChatGPT-4o, Claude, Gemini, and Mistral models all show increased over-generalisation when prompted for confident responses.

Domain-specific research reveals concerning patterns. In medical summarisation, newer models like ChatGPT-4o and LLaMA 3.3 70B overgeneralised in 26-73% of cases, even when explicitly prompted for accuracy.

Over-generalisation becomes less pronounced with newer and more sophisticated models, right? No. Counterintuitively, newer, more sophisticated models tend to perform worse at maintaining appropriate scope than earlier versions. This suggests that increased reasoning capabilities may actually encourage models to make broader inferential leaps that exceed their evidence base.

Over the course of extended interactions, over-generalisation compounds as models build upon their own previous over-broad statements, creating a feedback loop where initial over-generalisations become the foundation for even more sweeping claims.

TLDR: AI models tend to make broad, confident claims that often omit important caveats or nuances. These over-generalisations can mislead users and propagate misconceptions.

Always scrutinise absolute language ("always," "never") and ask for explicit evidence or citations. Be cautious of confident statements, especially when prompts request certainty and/or when interactions and documentation are lengthy; in both cases outputs are more likely to be inaccurate or misleading.

3. Dropped Details: The Information Leakage Problem

Numbers, lists, and other key contextual details are systematically lost when we share information with AI, especially as input complexity increases. This phenomenon, known as "information leakage," occurs due to fundamental limitations in how models process, retain and represent certain types of information.

Research shows that both premium and mid-range models may omit up to 35% of list items and mis-copy multi-digit figures more than one-third of the time. Even more concerning, GPT-3.5 demonstrates only 6% accuracy on certain numerical error detection tasks, meaning the vast majority of numeric processing contains errors.

Research on AI’s (in)ability to surface and interpret numerical data

Context window limitations also exacerbate information loss. The context window basically acts a bit like the human brain—as new information enters, older content gets "pushed out" and becomes inaccessible to the model. This creates particular challenges for multi-step instructions, lengthy procedures, or documents requiring comprehensive coverage. As prompts grow longer, even premium models designed for long-context processing can omit critical steps or sections when summarising large documents.

One of the most common use cases of AI that I see among Instructional Designers is uploading documents to extract data and summaries, but this process too comes with serious systematic weaknesses which we often ignore. For example, in one test of a document with ~40,000, Claude was found to have undercounted text elements by almost 50%.

In another test, GPT-4o performed well, achieved 88.7% accuracy in detailed data extraction for systematic reviews, successfully sourcing 415 out of 468 critical data elements (but missing 53…). The gold standard here appears to be Notebook LM which has been shown in some research to have a 95% data-locating accuracy (specifically, retrieval-augmentation or RAG).

The good (ish) news is that information dropping follows pretty predictable patterns. Models are most likely to retain information that:

Appears early in source documents.
Is shared within short-form documents & short, focused conversations.
Does not include technical jargon, specialised terminology or numbers.

TLDR: Numbers, lists, and other key contextual details are systematically lost when we share information with AI, especially as input complexity increases. To mitigate risk, use short documents and focused, structured prompting. Always cross-check AI outputs against source documents for completeness, particularly focusing on numerical data, procedural steps, and technical specifications.

4. Hallucinations: The Fabrication Risk

AI models regularly invent facts, sources, and citations with sophisticated plausibility that makes detection difficult, and recent studies suggest hallucinations are getting more — not less — common. OpenAI's latest models, for example, hallucinate at 33-79% higher rates compared to older versions.

When First (and Always) ChatGPT Practices to Deceive — A recent *New York Times* article describes how as models get more powerful, they also get more unreliable (May, 2025)

Research shows that ChatGPT-4 averages about 0.84 hallucinations per response, and Claude 2 about 1.55. Hallucination rates are even higher for legal or regulatory content. Meanwhile, “compact” low-cost models like GPT-4 o4-mini can hallucinate on up to 48% of prompts, while OpenAI's latest reasoning models (o3, o4-mini) show unprecedented increases, with o3 hallucinating 51% and o4-mini 79% on general knowledge tasks.

Citation fabrication represents a particularly sophisticated form of hallucination. Models don't simply generate random errors—they create convincingly realistic fake citations with real author names from relevant fields, properly formatted DOIs, and accurate journal names and formatting. This flavour of hallucination continues to range wildly, from a rate of 18% (GPT-4) to a whopping 91% (Bard).

Using retrieval-augmented models like Notebook LM and Perplexity significantly reduces but doesn’t eliminate hallucination. Research shows "very low to zero hallucination rate" when models are constrained to source material, but even RAG systems can hallucinate when they misinterpret source content or make inferential leaps beyond what the sources actually state.

TLDR: Hallucinations remain perhaps the most challenging and potentially damaging AI failure mode. Increasing model complexity means that hallucination rates don’t appear to be improving over time.

Retrieval-augmented models which “cite their sources” are the safest bet right now, but a culture of skepticism and skills in structured prompting and verification is essential to ensure that we mitigate this risk as much as possible when working with AI.

Maximising Value: Five Practical Tips for Working With AI in Instructional Design

Choosing the right AI model is no longer a technical detail—it's a foundational decision that shapes the quality, accuracy, and credibility of our instructional content. With dozens of models on the market, each with different capabilities and costs, it's essential to move beyond brand loyalty or habit and start thinking strategically about model selection.

So what does this mean in practice? To get the best results from AI—and avoid its most common pitfalls—Instructional Designers should adopt a structured, intentional approach. Below are research-backed, actionable tips for maximising reliability and minimising risk at every stage of your workflow.

1. Choose the Right Model for the Right Task

Premium models (e.g., Claude Opus 4, GPT-4.5, Gemini 2.5 Pro): Use for high-stakes deliverables, compliance, and nuanced instructional design. Always pair with fact-checking and SME review.
Mid-range models (e.g., Qwen3-32B, Mistral Medium 3): Use for drafting and iterative work, but schedule review cycles for accuracy—especially with numbers, jargon, and specialised topics.
Budget models (e.g., GPT-3.5 Turbo, Llama 3 8B): Use for brainstorming and creative ideation only. Never use outputs directly for learner-facing content.

2. Keep Prompts & Documents Short and Focused

Shorter is better: Models retain information best in short, focused prompts and with concise documents. As input length grows, the risk of dropped details and context loss increases.
Break up long tasks: For lengthy documents or multi-step instructions, split your work into smaller, manageable chunks and process them sequentially.

3. Be Wary of Numerical Data Retrieval & Processing

Numbers are vulnerable: Even top models frequently mis-copy, omit, or alter numbers and lists—especially in longer or more complex prompts.
Always double-check: Cross-verify every number, date, and list item against your source documents. Use checklists to ensure completeness.
Avoid complex calculations: Don’t rely on AI for multi-step arithmetic or extracting large sets of figures from documents; do this manually or with specialised tools.

4. Be Mindful of Prompting

Use few-shot prompting: Research shows that prompting matters. Provide 2–3 high-quality sample answers before your real question to set expectations and style. It will help to mitigate (partially) some of the biggest risks of working with AI.
Use chain-of-thought prompting: Instruct the model to “explain your reasoning step by step” or “think aloud” before answering. This exposes the model’s logic, making it easier to audit and spot errors. It will help to mitigate (partially) some of the biggest risks of working with AI.
Prompt for specifics: Ask for “list every date and its source” or “flag unsupported claims” to surface hidden errors.

5. Fact-Check & Audit Every Output

Require citations: Always ask the model to cite sources or quote directly from uploaded documents.
Use retrieval-augmented tools: For fact-sensitive or compliance content, rely on tools like NotebookLM, which anchor responses to your uploaded sources and reduce hallucination risk.
Anticipate AI’s Four Critical Failure Modes:
- Chronology Mix-Ups: Triple-check event order and timelines, especially in long or complex outputs.
- Over-Generalisation: Scrutinise absolute language (“always,” “never”) and prompt for exceptions and evidence. Accuracy drops with specificity: when tailoring content for specific learners, topics, regions etc learners or new regions, expect lower accuracy.
- Dropped Details: Compare outputs to source documents, focusing on lists, numbers, and procedural steps.
- Hallucinations: Never trust auto-generated citations or other factual information without independent verification. Where possible, use AI to execute not to decide.

Using Copilot or ChatGPT (Enterprise)?

These are best considered a premium-tier tool which overall perform OK within the Instructional Design workflow provided you prompt them well and use them for tasks within their respective wheelhouses.

Expect a hallucination rate of ~80% and a pedagogical accuracy of 86–89%.

For reliable data extraction, supplement with dedicated RAG tools like Perplexity or just old-skool manual review + human SMEs.

Concluding Thoughts

The Instructional Design role is changing. As AI use in the profession increases month by month, we are increasingly required to become “AI wranglers” — capable of choosing the right AI tools and model, guiding them with clear prompts, spotting their blind spots, and weaving their output into meaningful learning experiences without losing our pedagogical edge.

The days of picking AI tools based on what’s cheapest or built into our workplace tech stacks are behind us. If we want to design learning experiences that are accurate and truly impactful, we need to choose our tools as intentionally, just as we choose our learning outcomes.

AI can help us move faster for sure—but speed means nothing if it’s pointing us in the wrong direction. That’s why understanding which models to trust (and when to double-check them) has become a core skill in our professional toolkit.

My three top tips for this brave new world of Instructional Design:

Use the best model you can afford for the task at hand
Treat every AI-generated draft like a rough sketch—not a finished product
Build workflows that catch the common pitfalls before they ever reach your learners.

Happy innovating!
Phil 👋

PS: Want to test a range of AI models across your workflow with me? Apply for a place on my AI & Learning Design Bootcamp.

Dr Phil's Newsletter, Powered by DOMS™️ AI

Discussion about this post