Delegate, Collaborate, or Own: where AI fits in your L&D workflow
Aka, what I learned when I help to map the ~300 tasks involved in every L&D project to AI's strengths and limitations.
Hi folks 👋
Last month, the VP of L&D at a Fortune 500 company sent me a message that captured a frustration I’ve heard from dozens of L&D teams this year:
“My team keeps getting told contradictory things about AI. Our CEO wants us to use it for everything. Our compliance team wants us to use it for nothing. We need a way to decide. Can you help?”
The frustration isn’t unique to L&D. Widely reported estimates suggest that around 95% of generative-AI pilots fail to reach successful scaling (Fortune, 2025). BCG reports that 74% of companies had yet to see tangible value from their AI investments, with 70% of the barriers attributed to workforce and process issues rather than technology itself (BCG, 2025). Donald Taylor’s 2025 Global Sentiment Survey shows AI topping L&D’s “hot topics” list for the second year running — but Taylor notes explicitly that interest is high while maturity and understanding are still emerging (Taylor, 2025).
In my work with L&D teams, this shows up consistently as what I think of as scattergun AI use: lots of local AI use, very few strategic, targeted or fully AI-integrated workflows. TLDR: people are using AI, but the use is fragmented and rarely produces the gains the technology is genuinely capable of delivering.
Fast forward two weeks from that Slack message, and we (along with the team) had written down every discrete task an L&D team does on a typical project — every meeting, every artefact, every analytical move, every conversation. In total, we came up with a whopping list of ~300 tasks — everything from drafting reminder emails, to transcribing a 90-minute SME interview, to building content, to deciding what to cut when the project timeline slips.
With all ~300 tasks laid out in front of us, we ran each one through three questions and sorted it into one of three buckets:
🤖 Tasks which are structured and manageable enough to delegate to AI: tasks you would confidently delegate to an apprentice armed with resources, examples and spot checks. AI runs it, humans spot-check.
🤝 Tasks which are substantial enough to need us, but rule-bound enough for AI to help: tasks you can do well but need to be faster on, and tasks which are out of your reach without AI. Humans and AI working together.
🧠 Tasks which are too complex, high-stakes, or politically and culturally situated to involve AI: the work where being in the room — and standing behind the call — is the deliverable. AI doesn’t participate.
The headline finding I shared on LinkedIn last week:
roughly 10% of tasks fall into the delegate bucket
around 80% of tasks fall into the “do it with AI” bucket
only 10% of tasks are complex, nuanced and consequential enough to be justifiably owned only by a human
A small but important caveat before we go further: this distribution will look different in different organisations. Teams in heavily regulated industries, in contexts with low data maturity, or in functions with significant compliance overhead will see a much larger human-only category. The split below reflects one team’s analysis. The diagnostic is what travels.
That LinkedIn post got a lot of conversation going and generated a lot of interesting discussion. I want to use this Substack post to go deeper, and walk you through the decision-making behind the breakdown so you can run this exercise on your own process.
Let’s dive in!
Task Mapping, aka allocating the jobs to be done
Every task we mapped ran through three sequential tests. If a task failed any one test, it moved away from the delegation end of the spectrum and toward the human-only end.
💰 Question 1: The cost of being wrong
What happens if AI gets this task wrong?
Some mistakes are recoverable — you re-theme the survey, you regenerate the icon, you redraft the paragraph. Some mistakes aren’t — you misclassified a learner segment and built a programme for the wrong audience, you set the wrong mastery threshold and certified people who weren’t ready, you got the attribution wrong in a board deck and now the CEO doesn’t trust your numbers.
Recoverable error → AI can probably help. Catastrophic error → keep it closer to you.
One of the comments on the LinkedIn post (from Somya Sharma) crystallised this for me: the cost-of-error framing is more useful than the usual instinct to offload boring tasks. Most teams approach AI delegation by asking what can I get rid of? That’s the wrong question. The right question is what could I get wrong and not recover from? Those are the tasks that stay yours. Everything else is a candidate for AI involvement.
🧩 Question 2: Pattern or judgement?
Is this task pattern-matching, or does it require judgement?
Pattern-matching: tag this content against Bloom’s taxonomy; copy-edit this paragraph against the house style guide; theme these 400 survey responses by sentiment; identify which questions in this item bank have low discrimination. The rules are knowable and applying them is mechanical. AI is brilliant at them.
Judgement: which pedagogical model fits this audience — 4C/ID or Merrill? Is this attribution claim defensible to a sceptical CFO? Is this SME telling me what they actually do, or what they think they should do? The rules don’t exist, or applying them requires weighing things AI can’t see. AI is a poor substitute here.
Worth flagging a third mode that sits between the two: structured judgement under uncertainty — interpreting messy qualitative data, early-stage needs analysis, drafting recommendations from partial evidence. This isn’t pure pattern-matching, but it isn’t fully irreducible judgement either. It’s where AI is most powerful and most dangerous. We’ll come back to it in the 80% bucket below.
🧬 Question 3: Does this task require human presence?
Some tasks require a human to be physically present with another human, or to experience something themselves. Certifying a facilitator. Walking an SME through a task analysis. Testing a course with assistive tech. Watching a learner struggle with an interaction to understand why.
These aren’t complex in the way Question 2 is asking — they’re embodied. They require being in the room, in a way that goes beyond physical co-presence. AI can transcribe a meeting, summarise it in real time, even generate suggestions during it — and increasingly will. What AI cannot do is hold accountability for the outcome of the room. It cannot bear reputational or relational risk. It cannot experience the consequences of a bad call. The deliverable in these moments isn’t a task output — it’s a human who is on the hook for it. No prompt makes that condition go away.
A governance precondition
All three questions above assume your organisation has cleared AI use for the task in the first place. In regulated industries — financial services, healthcare, government, education in some jurisdictions — that’s a separate prior question that has to be answered before this diagnostic applies. Data privacy, model risk, auditability, IP leakage, and explainability are all real constraints that don’t show up in a cost-of-error / pattern-vs-judgement / human-presence test. They sit upstream of it. If you can’t get permission to use AI on a task, the bucket question is moot.
The Mapping Rule
🤖 A task that passes all three tests — low cost of error, pattern-based, no human presence required — lands in the Delegate bucket.
🧠 A task that fails any one of the three tests definitively — catastrophic error potential, pure judgement, or human presence required — lands in Keep human-only.
🤝 Everything else lands in the middle: Work with AI. This is where roughly 80% of L&D work sits — and it comes in two flavours:
Flavour 1: “I can do this, but slowly.” Tasks I know well, but that take me longer than they should. The move is to delegate by writing strong rules and clear examples, then quickly QA the output.
Flavour 2: “I can’t do this without help.” Tasks that are beyond my unaided reach — statistical analysis, research synthesis, large-scale pattern detection. The move is to let AI extend what I’m capable of, and bring my judgement to what AI produces.
A note on the mechanism: what these three questions are doing, structurally, is sorting work by the cognition it requires — not by the seniority of the worker, the visibility of the output, or how routine the task feels. Most teams approach AI by asking what feels routine, repetitive, or low-value enough to offload? — and that’s the wrong question.
Running the Diagnostic on Real Tasks
Here’s what the three questions look like applied to a sample of real tasks across the L&D workflow. Each row shows how a task tested against each question and where it landed.
The diagnostic isn’t perfect — some tasks sit right on the boundary, and reasonable people will inevitably disagree on edge cases. But what we find is that the diagnostic gives you a defensible starting point and, more importantly, it forces the right conversation. The team I worked with kept saying the same thing: “we’ve never had a way to talk about this before.”
The 10% We Can Delegate (and why it’s bigger than it looks)
The tasks that landed cleanly in the Delegate bucket are the ones where AI runs the work end-to-end and the human’s job is to spot-check a sample of its outputs at an agreed cadence.
Here are some of the tasks that showed up as “possible to delegate” across the workflow:
🔍 Analysis: scheduling stakeholder interviews. Transcribing them. Producing the first thematic clustering of responses. Pulling existing performance data from BI tools.
✏️ Design: tagging learning objectives against Bloom’s levels. Auditing storyboards for gaps in coverage. Generating distractor options for assessment items.
🛠️ Development: AI voice narration for first drafts. AI-generated closed captions. Copy-edit passes against the style guide. Generating layout-ready job aids from approved content. Image generation for non-brand-critical assets.
🚀 Implementation: drafting launch comms (announcements, reminders, FAQs). Personalising learner communications by segment. Generating manager briefing pack templates.
📊 Evaluation: pulling KPI data via BI tool API. Theming open-ended survey responses. Running item analysis on assessment data. Generating evaluation report appendices.
The pattern we didn’t expect
When we estimated time savings, the 10% delegate category was accounting for roughly 18–22 hours per ID per week. That’s nearly half a working week, recovered from tasks that were never the actual job — they were the friction tax between the ID and the moments where their judgement mattered.
The framing I’d encourage you to keep in mind: these tasks aren’t unimportant. They’re the price the team used to pay for the actually-important work. AI doesn’t make them go away. It moves them into a category where they cost you almost nothing.
The hours you get back from this 10% are the hours you reinvest in the 10% at the other end.
The 80% — aka, the “tricky middle”
This is where most L&D teams get most stuck. As one of the LinkedIn commenters (Paolo Perrone) put it — that 80% in the middle is where most teams haven’t figured out the workflow yet. They know they should be using AI, but they don’t know what using AI actually looks like in practice for the work they care about.
As I mentioned above, the 80% breaks into two flavours: faster work and better work.
Faster work is work you already know how to do. You have the expertise. You’ve done this before. AI just removes the keystrokes between you and the finished thing.
Better work is work that previously required specialists or capabilities you didn’t have access to. AI doesn’t just speed you up — it raises your ceiling.
Below are three workflows we documented during the sprint. I’ve structured them so you can extract the workflow itself, not just read about it.
Workflow 1: Drafting a learning needs analysis report (faster work)
Time before AI: 2–3 days
Time with AI: ~half a day
Bucket: 🤝 With AI
Inputs to gather:
Stakeholder interview transcripts (auto-transcribed by AI, themed by AI, validated by you)
Performance data from the relevant business units
Sponsor brief and intake notes
Existing programme documentation
Audience data from the HRIS
Steps:
AI (20 min): Feed the full corpus into Claude or your model of choice. Prompt it to produce a structured first draft against a standard report template (problem framing → performance gap → audience analysis → root cause hypotheses → recommended interventions → success metrics → risks).
AI output: A 4,000-word draft that captures roughly 70% of what the final report needs to say.
Human (2 hours): Edit for narrative — what should land hard, what should be quiet, where emphasis should sit for this stakeholder. Add political nuance — what’s true but doesn’t need to be said, what’s contested and needs handling carefully. Sharpen framing — make competent prose specific to this organisation, this moment, these people.
Human (30 min): Review with one trusted colleague before sending to the sponsor.
What AI is doing: Producing a defensible 70% draft from raw inputs.
What you’re doing: The 30% that turns a draft into a deliverable.
What to watch for: AI will produce confident-sounding recommendations that aren’t grounded in the data you gave it. There’s a well-documented phenomenon called automation bias — the tendency to over-trust decision aids and accept their outputs without sufficient scrutiny — that’s been observed in both novice and expert users, and is surprisingly resistant to training or warning labels (Parasuraman & Manzey, 2010). The defence is structural: if something in the draft surprises you, check whether the evidence is actually there before you use it.
Workflow 2: Semantic analysis of open-ended survey responses (better work)
Time before AI: Two weeks with a research assistant — or, more honestly, didn’t happen at all because most teams didn’t have the resourcing.
Time with AI: ~30 minutes
Bucket: 🤝 With AI
Inputs to gather:
800 open-ended responses from a recent post-programme survey
The original survey questions
The learning objectives the programme was meant to address
Steps:
AI (5 min): Paste responses into Claude with a structured prompt for thematic clustering: identify major themes, count responses in each, surface sentiment distribution, flag surprising outliers.
AI output: A structured summary with named themes, response counts, representative quotes, and sentiment per theme.
AI (5 min): Follow up — ask for explicit linkage from each theme back to the original learning objectives.
Human (15 min): Validate the themes — are they pedagogically meaningful, or has AI grouped things that don’t belong together? Are there themes that should have surfaced but didn’t?
Human (15 min): Decide what matters — which themes are interesting versus actionable, what to escalate to the sponsor, what to bring into the next iteration of the programme.
What AI is doing: Compressing two weeks of manual coding into 30 minutes of analysis.
What you’re doing: Validating the analysis and deciding what it means.
What to watch for: AI will sometimes cluster things that look similar but mean different things (e.g. “the content was hard” and “the content was hard to apply” — these are very different findings). Read sample quotes from each theme to catch this.
Workflow 3: Forecasting KPI impact with regression analysis (better work)
Time before AI: Needed a data scientist or external consultant — most teams settled for benchmark claims like “similar programmes typically deliver a 10–15% improvement.”
Time with AI: ~1 hour
Bucket: 🤝 With AI
Inputs to gather:
Historical performance data on the relevant KPI (12–24 months)
Programme reach and intensity metrics
Comparable interventions and their measured impact
Steps:
AI (15 min): Give Claude the data with a structured prompt. Ask for a regression analysis estimating likely effect sizes with confidence intervals.
AI output: A working statistical analysis with all assumptions visible, plus a list of additional data that would tighten the prediction.
Human (45 min): Interpret — what does this mean for the sponsor’s expectations? What confidence level is defensible to share externally? Where would a sceptical CFO push back, and how would I respond?
What AI is doing: Statistical analysis that used to need a specialist.
What you’re doing: Interpretation and translation — turning numbers into a defensible position.
What to watch for: Treat AI-generated analysis as provisional modelling, not production-grade analytics. Models can silently make incorrect statistical assumptions, mishandle time series vs cross-sectional data, or report confidence intervals that are too narrow if your historical data is noisier than it looks. This workflow is suitable for forming a defensible position with the sponsor and shaping internal decisions — for high-stakes external reporting, your numbers should still be validated by someone with formal statistical training.
The pattern across all three workflows
AI does the labour. The human does the direction and the interpretation. AI produces drafts, surfaces patterns, runs analyses. You decide what’s true, what’s relevant, what the recommendation should be.
This connects directly to research I covered in an earlier post on cognitive offloading (Hardman, 2026) — the productive use of AI in learning happens when humans delegate enough substantive work that they free up cognitive capacity, and then deliberately invest that capacity in the work AI can’t do. The same logic applies to professional practice. AI takes the labour; you do the higher-order thinking. The workflow problem is that most teams haven’t structured this trade explicitly.
The ceiling-raising effect
There’s a bigger consequence here that one of the LinkedIn commenters (Sandeep Bobhate) named well. For twenty years, the gap between elite L&D functions and everyone else was largely a resourcing gap. The big in-house teams at McKinsey or Deloitte could afford the data scientists, the researchers, the literature reviews. Smaller teams, in-house functions, NGOs, the global majority — were locked out of the same depth of analysis.
The empirical evidence on this is striking. A landmark field study by Brynjolfsson, Li and Raymond, published in The Quarterly Journal of Economics in 2025, examined the rollout of a generative-AI assistant for over 5,000 customer support agents (Brynjolfsson, Li & Raymond, 2025). Access to the tool increased productivity by about 14–15% on average — but that average obscures a much more interesting finding. Novice and low-skilled workers saw productivity gains of around 34%, while the most experienced workers saw only modest speed improvements and sometimes slight declines in quality. The authors interpret this as the AI system effectively diffusing the best practices of high performers to less-experienced workers, narrowing the performance gap.
The mechanism is doing exactly what the L&D evidence suggests: AI raises the floor faster than it raises the ceiling. A team of two with AI can now run analyses that used to need a team of ten. The differentiator between L&D teams stops being who has the resources and becomes who asks the better questions. And question-asking is a much more democratically distributed skill than budget-controlling.
The 10% that Should Stay Human-Only
Of everything we sorted in this sprint, the 10% human-only bucket was the one the team kept coming back to. Not because it was the largest — it was the smallest — but because it kept revealing the things that actually defined the work.
Four moments landed in this bucket consistently across the project lifecycle:
👉 The sponsor conversation where the real problem surfaces. The “we don’t actually need training, do we?” moment, as Andrew Barry put it in a comment on the LinkedIn post. He ran a cohort last year where every single L&D person named that exact conversation as the hardest part of their job. Not the design. Not the build. The room where the sponsor talks themselves out of the work — or, more often, the room where you have to talk them into the right work even though they came in asking for something else.
That conversation turns on the ID’s ability to hold their ground without sounding defensive. It’s a consulting skill, not an L&D skill. And no amount of AI assistance changes the dynamic of being accountable for the diagnosis in that room. (For the underlying analytical work — distinguishing training problems from performance, process, and culture problems — Mager & Pipe’s Analyzing Performance Problems and Gilbert’s Behavior Engineering Model remain the canonical references (Mager & Pipe, 1970; Gilbert, 1978).)
👉 The SME validation moment. You’ve spent two weeks doing task analysis. You sit down with the SME to walk them through it. Mid-sentence, they stop you and say, “we never actually do it that way.” That moment — the interpretation of what they meant, the follow-up question that surfaces the actual practice — is irreducibly human. AI can transcribe the conversation. It can theme the responses. It can’t be the person noticing what isn’t being said.
👉 The pedagogical judgement call. Whole-task or part-task progression? Scaffolded practice or productive failure? These decisions don’t have right answers. They have better answers given this audience, this content, this organisation, this moment. AI can lay out the trade-offs. The call is yours, and it’s yours because you’re the one who can hold all the contextual variables AI can’t see — and because you’re the one whose name is on the design.
👉 Certifying a facilitator in person. You watch someone facilitate. You feel the room. You know whether they can hold a difficult group, recover from a wrong turn, handle pushback without losing the cohort. AI can observe the session, transcribe it, even surface patterns from the data. What it cannot do is vouch for the facilitator. The certification is an embodied professional judgement that has to be made — and signed — by someone who’s done the work themselves.
The asymmetry of error
What ties these four moments together — the second pattern that surprised me in the sprint — is the asymmetry of error.
You can redraft a report. You can re-theme a survey. You can regenerate an image. You cannot un-certify a facilitator who shouldn’t have been certified. You cannot un-tell the sponsor that training is the answer when it wasn’t. You cannot un-make a pedagogical call once the programme is built around it.
In decision-making terms, these are asymmetric decisions: the cost of a false positive is not the same as the cost of a false negative. Jeff Bezos has popularised this distinction as Type 1 versus Type 2 decisions — Type 1 decisions are “one-way doors” that are hard to reverse and demand careful deliberation; Type 2 decisions are “two-way doors” that can be easily walked back and reward speed and experimentation (Bezos, 2015). The 80% bucket in the framework above is full of Type 2 decisions. The 10% human-only bucket is full of Type 1 decisions. The mistake most teams make is treating them as if they belonged in the same category.
The 80% is where you save time. The 10% is where you justify your role.
One of the comments that gets this most clearly came from Tomasz Sobierajski: these are tasks where the human presence is the actual mechanism, and no amount of speed elsewhere makes up for getting them wrong. The mechanism word matters. In the 10% category, you aren’t doing tasks that produce value — you are the mechanism by which value gets produced. That’s a different kind of work, and it’s the work AI structurally can’t replace.
The implication for how L&D teams should be allocating attention: most teams I see spend too much time in the 80% and too little in the 10%. Partly because the 80% is where the visible deliverables sit — the reports, the decks, the modules, the dashboards. Partly because the 10% is uncomfortable — the conversations are hard, the judgement calls are exposing, the embodied presence is exhausting. But the leverage is the other way around. The 80% is what AI multiplies. The 10% is what nobody can multiply for you.
The Bigger Question: is ADDIE itself obsolete?
As a few of my LinkedIn followers commented, asking how we use AI to “augment” and “enhance” ADDIE is asking the wrong question altogether.
ADDIE was codified in the 1970s by US military training research — and it was a good answer to the problem it was solving. Training had to be reliable, reproducible, and scalable at a moment when expert designers were rare and production was expensive. ADDIE imposed sequence on what had been an artisanal process: first you analyse, then you design, then you develop, then you implement, then you evaluate. The order was the value. It meant that teams without a master designer could still produce competent training, because the framework carried the expertise.
Underneath the five phases sat four binding constraints that the framework was implicitly built to manage:
Expertise was scarce. You couldn’t have a learning scientist on every project. The framework substituted for one.
Production was expensive. Every minute of video, every printed workbook, every facilitator certification cost real money and time. You had to design carefully because rework was costly.
Feedback was slow. You launched the programme, waited months for Kirkpatrick L3 data, and only then learned whether it worked. The framework had to front-load decisions because back-loading them was impossible.
Learners were held captive. People took the training because their employer told them to. You didn’t need to design for engagement the way a consumer product would; you needed to design for completion and compliance.
For fifty years, this was a defensible architecture. The constraints were real, the framework was a reasonable response, and most of the variation in L&D outcomes was explained by how well a team executed against the framework, not whether the framework itself was right.
Three of the four constraints are collapsing
The picture in 2026 is unrecognisable from the picture in 1976.
👉 Expertise isn’t scarce anymore. Any L&D practitioner can now access the equivalent of a senior learning scientist as a thinking partner — on demand, at near-zero cost, with knowledge of every framework, every empirical study, every methodological debate in the field. The expertise that ADDIE was designed to ration is now abundant. The consequence: the framework’s central job (encoding what a senior designer would do, so juniors could approximate it) no longer needs the framework.
👉 Production isn’t expensive anymore. Narration, video, item generation, captions, layout, copy-editing, asset production — much of it is now near-free and near-instant. A storyboard can become a working module in a morning. (High-fidelity simulations, enterprise integrations, and compliance-heavy builds still carry real cost — but the baseline production tax has collapsed.) The consequence: the architectural premium ADDIE placed on getting it right before you build is now misplaced. You can ship, watch, and iterate faster than you can document.
👉 Feedback isn’t slow anymore. Telemetry is continuous. Learner behaviour is observable in real time. Sentiment is measurable across thousands of learners simultaneously. The consequence: the assumption that evaluation is something you do after the programme is finished is now an artefact, not a necessity. Evaluation can run continuously, feeding back into design before the next cohort even launches.
Only the fourth constraint still binds us: learners are still mostly being held captive. People take corporate training because their employer told them to, not because they chose to. That hasn’t changed — and it remains the single biggest design challenge in the field.
What replaces ADDIE probably isn’t another linear model
When three of four constraints collapse, the framework built to manage them stops being load-bearing. ADDIE becomes a historical artefact — useful for understanding how the field got here, but not the right architecture for designing the work going forward.
What replaces it almost certainly isn’t another linear model. Linear models exist to amortise expensive production over slow feedback loops. When production is cheap and feedback is continuous, you don’t need a sequence — you need parallel cycles.
The model I’m calling the Four Loops maps the enduring jobs of L&D onto four cycles that run simultaneously, feeding each other:
The diagnosis loop — continuously sensing what the business needs, what learners are struggling with, where performance is breaking down. Always on, not a phase.
The design loop — continuously testing pedagogical choices, scaffolding decisions, model selection. Iterative, evidence-driven, never finished.
The production loop — continuously building, shipping, revising. The unit of work is the smallest releasable improvement, not the finished programme.
The evidence loop — continuously measuring outcomes, surfacing what’s working, escalating what isn’t. Real-time, multi-channel, decision-feeding.
This is structurally how product teams work — and it’s the model L&D will increasingly resemble as the field catches up to the technology it now has access to.
But all of this is for another day. In 2026, ADDIE is still the language and mindset most L&D teams operate in, and the work in this post is about making the best of where we actually are, not where we might be.
The Four Loops or something like it will come — but for most teams reading this, the right next move isn’t to overhaul the framework; it’s to get clearer about which parts of the work AI belongs in, and which parts it doesn’t. That’s where the leverage and gains sit today, both for the business and the learning professional.
Closing Thoughts
The most interesting thing about doing this exercise — really doing it, task by task, across 300 of them — is that it changes how you think about your role.
When you can name precisely which work you’d hand to AI without hesitation, which work you’d want AI alongside you for, and which work you’d defend as yours under any circumstances — you’ve done more than build a workflow. You’ve articulated what your job actually is, and what you want it to be.
Most L&D professionals can’t articulate that right now. Not because they don’t know, but because the field hasn’t given them the language. The 10/80/10 framework is one attempt at that language. The Four Loops, when it lands, will be the next.
Whichever framework you use, the underlying move is the same: take the time to know what you actually do. Then decide, deliberately and intentionally, if, where and how AI fits.
And remember:
AI takes the work that didn’t deserve you. It buys you back the time to grow into the work that does. And it gives you back to the moments where your judgement, your expertise, and your presence are the whole point.
Phil 👋
PS: If you want to do work like this with me — alongside a cohort of L&D professionals applying this analysis to their own projects — check out my AI Bootcamp for L&D.






