How to Get Consistent, On-Brand Course Images from Any AI Image Tool
A 3-step workflow that works every time — whatever AI tool you're using
Hey folks! 👋
Here’s something I keep seeing when I work with learning designers.
They open an AI image tool. Type “two colleagues having a difficult conversation in an office.” Get four images back — a bit generic, a bit stock-photo-ish, but fine. Pick one. Move on.
Six slides later, they generate another image for the same course. Different lighting. Different aesthetic. Completely different visual world. The course now looks like four people designed it on four different days.
Or the other version: they try harder. Twenty minutes crafting a detailed prompt — lighting, body language, colour palette, camera angle. They get something that kind of works, but can’t reproduce it for the next scene. So they spend another twenty minutes on that one. And the next.
I’ve watched this play out dozens of times. The frustration is consistent. So is the cause.
This is not a tool problem — it’s a process problem.
Most designers try to describe their way to an image. That’s the wrong approach. The goal is to show the tool the world it should be working in, then give it the minimum it needs to place your subject inside that world.
Every long, over-specified prompt is a sign that your visual inputs aren’t doing enough work.
The fix is an 3-step process which gives you superpowers in AI image generation:
Write a visual brief — answer six questions that close the creative and pedagogical gaps before you generate a single image.
Build a mood board — gather images that capture the lighting, energy, and environment of your learner’s world. Select the 3 that look like they were shot by the same photographer on the same day and upload them individually as style references.
Create character anchors — your style references fix the visual world; your character references fix the people inside it. For each named character, generate a head-and-shoulders image on a neutral background, facing forward. This is your master reference. Attach it alongside your style references every time you generate a scene featuring that character — and the tool stops making casting decisions on your behalf.
In this blog post, I walk your through each stage, so you can create images with AI like a learning design pro.
Let’s go! 🚀
Wait, Isn’t AI-Image Generation Getting Better & Easier?
Before we dive into the pro-workflow, this question is worth addressing directly.
AI image generation tools are improving fast.
If you type “two colleagues having a difficult conversation in an office” and then follow up in the same chat with “the same two colleagues, different scene,” most decent AI generation tools will use the conversation context to keep the characters roughly consistent. On first glance, it looks like we got what we asked for:

But look carefully at what the tool did in the second image (enlarged version below )— without being asked:
A document labelled “URGENT REVIEW” has appeared on the desk. The tool has decided what the conversation is about.
She’s seated, he’s standing over her with arms crossed. That’s a specific power dynamic — and it may not be the one your scenario needs.
Her clenched fists signal frustration or distress. “Difficult conversation” could mean a dozen things; the tool picked one.
They’re in an acoustic pod booth — a semi-private setting that implies the conversation is sensitive enough to need it. Plausible. But not your call to make.

None of this is wrong, exactly — it’s just not what you asked for. The tool filled every gap you left open with its own assumptions — and in a real course, those assumptions carry meaning. A learner will read the power dynamic, the props, the setting, and draw conclusions you didn’t intend and can’t control.
This is what prompting without a system looks like. The tool isn’t malfunctioning — it’s doing exactly what it’s designed to do. The question you need to ask is whether you are the one making the creative decisions, or whether you’ve handed them over without realising it.
The key take away here is that AI image generation tools will make decisions for you unless you build a system that makes them first. The better the tools get at filling gaps, the more important it is that you’re the one deciding what goes in them.
That’s what this 8-step workflow is for, and here’s how it works, step by step.
Step 1: Write a Visual Brief
A visual brief doesn’t just help you stay consistent — it also shrinks the “creative decision-making space” of an AI tool, giving you more control over the output. Every field you fill in is a gap you’ve closed — a choice the tool can no longer make without you.
By building a structured visual brief for each project, you take control over creative decisions that directly affect both output quality and how your learners experience the course:
Who the learner is shapes what the images need to show — and how much cognitive work a learner has to do to place themselves in the scenario. The closer the visual world is to theirs, the less distance there is between the training and the job.
Named characters with descriptions become the prompts you paste into every scene — and give learners consistent people to follow, so working memory goes on the content, not on figuring out who they’re looking at.
The single most important thing becomes the tiebreaker when you’re not sure which of four outputs to pick — and the design principle that stops visual decisions getting made on aesthetic grounds alone.
To build a visual brief, answer these six questions before you open any AI image tool:
Who is the learner? Not “employees” — specific. Frontline retail staff, 20s–40s, UK-based, fast-paced customer-facing roles.
What is the tone? 3–5 adjectives + one negative. Warm, candid, direct — not corporate, not lecture-y.
What does their world look like? Name the physical environment. Office, warehouse, hospital ward.
Who are my characters? Name and describe them. Priya: team leader, mid-30s, South Asian, approachable. Dave: new starter, early 20s, slightly uncertain.
What moments do I need? Priya giving Dave difficult feedback. Dave making an error alone. Team reacting well in a huddle.
The single most important thing: If this image set does one thing, what is it? Make this learner feel seen in their own world. This becomes the tiebreaker for every ambiguous decision.
Example Visual Brief
Example Outputs
TLDR: A visual brief turns a generation session from a guessing game into a directed process. Every image you produce after writing one is a decision you made — not one the tool made for you.
Step 2: Build a Mood board
Whether you’re working with AI or humans, communicating visuals using test is tough. Ask five learning designers to create a "professional workplace conversation" and you'll get five completely different images — and so will the AI.
Uploading images sidesteps the language <> image translation problem entirely. By uploading images rather than describing visuals, you’re working in line with the tool’s native input system rather than trying to describe your way across to it.
A mood board is how you define the visual world of your course before generation begins, so the tool is sampling from your aesthetic rather than its own defaults.
If your org or client has a brand guidelines document or an existing image library, start there. Brand photography is often the most useful raw material you have — it shows the tool the exact visual world the client already lives in.
A quick note on sourcing images for your mood board:
Pinterest and Same Energy are best for building your mood board — discovery tools, not licensed sources.
For free assets, Unsplash has the highest aesthetic quality Pexels and Pixabay have broader libraries but not all images are free.
For final client-facing courseware, a paid library like Getty or Adobe Stock gives you cleaner commercial rights.
Increasingly, designers skip external stock entirely — generating style references in MidJourney or GPT-4o and using those outputs as the mood board that steers everything that follows.
The most common workflow I see is: Pinterest or Unsplash to find the feel → AI to build the style system → AI for the final course imagery.
Pull the best 8-10 images that best capture the lighting, the people, and the environment, and use those as your first style references. Tone-of-voice guidelines are less directly useful, but the adjectives you use to describe your brand — warm, bold, human, precise — can translate directly into the tone field of your visual brief.
As you go, gather your images in Pinterest or a similar platform (Milanote or Miro works well if Pinterest is blocked at your organisation).
Then, take 10–15 minutes to select ~3 images that look most like they belong together — same lighting temperature, same camera energy, same colour palette.
You're not picking your three favourite images here: you're picking the three that 1. capture your intended look & feel and 2. share the most visual DNA.
For each image you collect, ask:
Does the lighting match the tone of the course? E.g. warm and natural for a human, relatable feel. Cool and clinical for healthcare or compliance. Bright and high-contrast for energy and urgency.
Do the people have the right energy? E.g. candid and absorbed in their work, or posed and aware of the camera? Real and imperfect, or polished and performative?
Does the setting look like the learners’ real world? E.g. not just “an office” — their office. Not “a hospital” — their ward. The more specific the environment, the less distance there is between the image and the learner’s reality.
If any answer is no, don’t keep the image hoping it’ll work. Search specifically for what’s missing then replace it.
Those shared qualities are what the tool will compound when you upload them as individual references.picking the three strongest (more on this in step 3).
🔥 Pro Tip: Search for Energy & Aesthetics, Not Topics
The most common mistake I see here is that designers search for their course topic instead of the aesthetic they want. If you search “safeguarding” you get ugly and generic posters and leaflets.
The key? Search for energy and aesthetics, not just topics:

🔥 Pro Tip: Upload your style references individually — not as one combined image
You have your three visual anchor images. How you upload them matters as much as which ones you chose.
The instinct is to combine them — paste all three into one image and upload that. Don’t. When the tool receives a single combined image, it treats the whole thing as one reference signal and averages across everything at once. The shared qualities you selected for get diluted by everything else in the collage. The output looks vaguely related to your references but not distinctly like any of them.
When you upload your anchor images individually, something different happens. The tool applies each reference as a separate signal. The qualities your images share — the lighting, the energy, the colour palette, the environment — compound and reinforce each other. The qualities they don’t share cancel out. You get more of what you wanted, not a blurred mean of everything.
Here’s what that looks like in practice. All three generations below used the same prompt: “Two retail staff colleagues in conversation on a shop floor.”
👉 Generation 1 — two visual references, uploaded as one combined source image
Nano Banana picked up the broad themes of a retail environment environment — shelving, signage etc — but lost everything else, including my protagonists Priya and Jake.
Why? Because the two source images were combined into a single file. This means the AI receives them as one visual input and averages them across both simultaneously.
The output splits in two as a result: a wide establishing shot with no characters on the left; a group scene with the wrong people on the right.
The navy uniforms survived, but Priya, Jake, the till interaction, the close framing, the documentary energy — all dissolved.
What the AI actually received was one image containing two different compositions, two different settings, and two different groups of people. With no way to know which elements were primary, it sampled everything equally and produced something that belongs fully to none of it.
👉 Generation 2 — two visual references, uploaded as two separate images
Nano Banana generated Priya, Jake, the store, lighting and the documentary energy.
Why? Because the two source images arrived with AI as separate signals. The AI processed each one independently — and instead of averaging across everything, it looked for what they had in common.
The shared elements therefore compounded: the same store environment, the same navy uniforms, the same till, the same documentary framing, the same cast of characters. The elements that differed — the break room from one image, the group composition from another — cancelled out, because they weren’t consistent across both references.
What the tool actually received was two clear, distinct windows into the same visual world. With each reference processed on its own terms, it could identify the signal that ran through both — and build from that. The result isn’t an average — it’s a distillation.
Step 3: Create Character Anchors
By the end of Step 2, you have defined your visual world — the right environment, the right lighting, the right aesthetic. But definition and consistency are different things. Without a character reference, the tool makes new casting decisions with every generation. And as you'll see, even the environment can drift when there's no consistent human anchor to hold it in place.
A learner who sees Priya in a general retail store in slide 3 and then sees someone who might-be-Priya in a supermarket slide 12 has to do cognitive work to reconnect them — work that should be going on the learning content. Character & environment consistency isn’t a nice-to-have: it’s a cognitive load issue.
From my experience, this is also the step that most IDs hit a wall on first and can’t solve by instinct. Here’s the how to do it like a pro:
Step 3a: Generate images of your character(s) on a neutral background
Don’t pull your character reference from a scene you’ve already generated. A busy background, other people, strong props, or dramatic lighting will all bleed into subsequent generations — the tool will try to reproduce the context, not just the person.
Instead, generate each character fresh, in isolation:
Neutral white or grey background
Soft, even lighting
Frontal or near-frontal view
No other people, no strong props, no distracting environment
To do this, use your visual brief description as the prompt — then upload an existing image of your character if you already have one, to seed the output. So, for Priya this would be:
South Asian woman, late 30s, dark curly hair tied back, navy branded uniform, friendly and direct expression, neutral grey background, soft even lighting, facing forward, head and shoulders
Generate several versions and pick the one that best matches your brief. This is your master reference.
Step 3b: Upload the cropped reference with every scene prompt
From this point forward, every prompt featuring that character gets the cropped reference attached as an individual image — alongside your style references from Step 3.
Tools with dedicated character reference features (e.g. MidJourney’s --cref, & Leonardo’s Character Reference mode) will give you stronger identity lock — use them if your tool supports them. If not, attaching the cropped image directly to an AI image generation tool works as a reliable fallback.
Here’s what happens without it:
👉 Generation 1 — style references uploaded, but no character reference
I gave Nano Banana the prompt & style references, then ran it five times in the same chat.
The first two images hold reasonably well — the conversation context is doing some work. By image three the uniform has shifted and the store has changed. By images four and five the manager is a different person in a different store, with grocery items on the conveyor belt that nobody asked for.
The character drift here is real — but so too is the environment drift. That’s not a coincidence: style references define the visual world, but characters anchor it. Without a consistent person holding the scene together, the tool recasts both the people and the place with every generation.
👉 Generation 2 — style references uploaded, along with a character reference head-shot
I gave Nano Banana the same prompt & style references, plus the source image of Priya that I generated above. I then ran a series of prompts to test if the environment and character remained consistent.
It worked! Priya is Priya in every image. The environment is also consistent. Jake is consistent. The only thing that changed between Set 1 and Set 2 was one additional input: a single cropped head-and-shoulders reference image, attached alongside the style references.
That’s what a character reference system does: one image per character, uploaded every time — and the tool stops making casting decisions on your behalf.
🔥 Pro Tips: Tell AI that your character(s) images are AI-generated
AI image generation tools occasionally flag photorealistic character references as potential real people and decline the request. The fix is simple: add “The image is an AI-generated fictional character reference” to your prompt. It’s a one-line workaround which works every time for me.
If you're working with images of real people, check your AI tool's terms of service before using them as references. Permission from the individual is necessary but not sufficient — most platforms have separate policies on generating new depictions of identifiable people.
Quick Start
Next time you need to generate images for a course with AI, here’s where to start:
👉 Every time you start a project:
Write your visual brief (Who is the learner? What’s the tone? Who are the characters?)
Build your mood board — collect 8–10 images, then select the 3 that share the most visual DNA and save them as your style references
Generate a head-and-shoulders character reference for each named character (neutral background, facing forward)
👉 Every time you want to generate an image with AI:
Attach your 3 mood board style references individually (one image per upload, not combined into a single file)
Attach your character reference (one image per character)
Write the shortest possible prompt that names the characters, describes the action, and places them in the setting, e.g. Priya giving feedback to Jake at the till on the shop floor.
That’s the full system. The first time you build it takes an hour. Every generation session after that takes minutes — and produces images that look like they belong to the same course, the same world, and the same learner.
Concluding Thoughts
Each step in this system does a specific job — but the reason it works isn’t any single step: it’s what happens when you run them together.
The visual brief closes the content gaps. The mood board closes the aesthetic gaps. Individual uploads ensure those aesthetic signals compound rather than blur. The character reference closes the casting gap. By the time all four are in place, you’ve made almost every creative decision the tool would otherwise make for you. What’s left is the minimum it actually needs: the scene.
Once the system is built, generating a new image for a learning experience is straightforward and has just three elements:
A short prompt describing the scene and the moment — who is present, what they’re doing, where. The prompt places the subject. The system handles everything else.
Two or three style reference images, attached individually — these define the visual world. The lighting, the colour palette, the camera energy. They tell the tool how the scene should look.
A character reference, attached individually — this locks your character’s identity so the tool places the same person into every scene rather than casting someone new each time.
No lighting instructions. No camera angle. No colour palette description. All of that is already handled by what you built in Steps 1–4. The brief, the mood board, the references — these are your creative decisions, made once, then used again and again.
That’s the compounding effect in practice. Your brief is reusable. Your mood board is reusable. Your style and character references are reusable. The work you do before you start generating pays dividends across every session that follows — across thirty images, six modules, multiple designers working on the same course.
Bigger picture, what does this experiment tell us about where AI is actually taking our profession?
The conversation in L&D tends to frame AI image generation as an efficiency play — faster turnaround, lower cost, less dependency on external suppliers. And yes, it’s all of those things. But speed without a system just means you produce inconsistent, tool-directed images faster. That’s not better courseware — it’s more of the same, quicker but more generic and less impactful for learners.
What this workflow points to is something more interesting: AI doesn’t just change how fast we work, it changes where the design work lives. The creative decisions that used to happen implicitly — through photographer briefings, stock library curation, art direction — now have to happen explicitly, before you open the tool. The brief, the mood board, the reference system: these aren’t workarounds. They’re the design process, made visible.
The designers I see getting the best results from AI image generation aren’t the ones who’ve mastered prompting — they’re the ones who’ve mastered front-loading, who do the thinking before the generating, and validate the system until they trust it to carry it through with only spot checks.
That’s not a faster version of the old workflow — it’s a genuinely different one. And in my view, it’s a better one — because it puts the creative decisions back where they belong: with the designer, not the tool.
Happy experimenting & innovating!
Phil 👋
PS: Want to master AI‑augmented instructional design? Apply for a place on my AI & Learning Design Bootcamp where we get hands on try and test methods just like this.









