Part 2: Designing the Study and Building the Modules
Part 2 of a three-part series on integrating generative AI into an eLearning module for personalized feedback, a project with Jennifer Chien later published in Training Industry. [Read Part 1 here.]
In Part 1, I shared how this project came together: a goal to get published, a hot pot dinner with Jennifer Chien, and a question worth chasing. Can generative AI give learners real-time, personalized feedback, even on a topic as sensitive as suicide prevention?
This post is about the harder part. Designing the study, building the modules, and discovering that the single most difficult element was something I underestimated completely.
What we set out to measure
We weren't writing a formal academic paper, but we still wanted real data. We focused on two primary things: engagement and effectiveness.
Engagement: does typing your own answer and getting personalized feedback pull a learner in more than selecting from a list, even when the scenario itself is well written? Effectiveness: did learners actually take something away from it?
We also gathered data on satisfaction, on attitudes toward the subject matter, and on specific qualities of the AI feedback itself: clarity, sensitivity, personalization, and how actionable it was. When you're already collecting data, it's worth collecting enough to actually learn something.
Two modules, two versions
The design centered on a single scenario rendered two ways.
The traditional version: a question, a set of answer options, and pre-written feedback based on what the learner selected. This represents how interactive eLearning feedback usually works.
The AI version: the same scenario, but instead of choosing from options, the learner types their own response into an open field. An open-source generative AI tool then reads what they wrote and returns immediate, personalized feedback. A "try again" button let learners clear the interaction and experiment with different approaches, as many times as they wanted.
The scenario itself: how to start a conversation with a coworker you're concerned might be at risk of suicide. The learner reads brief introductory material, then is dropped into the moment and asked what they would say.
We added a parallel leadership styles module as a lower-risk comparison. If AI feedback only worked for safe, low-stakes content, the leadership module would show it. If it could hold up for suicide prevention too, that would be the more meaningful finding.
The part I underestimated: the prompt
I assumed the technical integration would be the hard part. It wasn't. Jennifer handled that side skillfully, using JavaScript and a hosted server to connect the AI tool (ChatGPT) into Articulate Storyline.
The hard part was the prompt, and the question behind it.
My original question was nearly closed-ended. Lay out the warning signs of suicide risk, present a scenario, and ask the learner to identify which signs the coworker was showing. The problem became obvious quickly: that's a multiple-choice question wearing a costume. It could have been handled with pre-written feedback and answer options. It did not use what AI actually makes possible.
To make the integration worth anything, the question had to invite a genuine, open response, the kind of answer that varies meaningfully from person to person and that pre-written feedback could never anticipate. That meant rewriting the question to ask the learner how they would actually start the conversation, in their own words.
Then came the prompt itself. Early versions produced feedback that was inconsistent, sometimes generic, occasionally misaligned with what the learner had written. In some cases the AI responded "positively" to a poor answer. In others it referred to the learner in the third person, which broke the conversational tone we wanted. On a topic this sensitive, inconsistency isn't just sloppy. It could be a serious risk.
Getting it right took iteration and expert review. We refined the prompt repeatedly, applying prompt engineering practices and running versions past a suicide prevention subject matter expert to check that the guidance was accurate and appropriately handled. This wasn't a one-and-done. It was a loop: test, review, refine, test again.
Why the output format kept changing
One honest thread in this project: the final output format moved around a lot.
We first looked at a conference, then a peer-reviewed research article, before landing on a case study published in an industry magazine. Each shift taught us something about scope and fit. A full academic study didn't match how either of us works best; a case study let us keep rigorous measurement while writing in a voice and format that suited us and our field. Naming that honestly matters, because real projects rarely move in a straight line, and pretending otherwise doesn't help anyone learning from them.
By the end of this stage, we had two working modules, a refined prompt vetted by an expert, and a survey ready to go. The next step was the one that would tell us whether any of it actually worked: putting it in front of reviewers.
That's Part 3: what 17 experts said, what the data showed, and the most useful thing we learned, which turned out to be a limitation.
Part 3: The results, and what the experts taught us. [Coming next.]