The Question That Can't Wait: Launching the AI Item Authoring Research Sprint

AI tools are already authoring test items in classrooms and examination halls worldwide. But are we truly understanding what happens when algorithms replace—or augment—human item writers?

Published: 11/27/2025

The Question That Can't Wait: Launching the AI Item Authoring Research Sprint

Join the Research Sprint: https://research.testcommunity.network/

This isn't an abstract, future-focused question. Organisations are making decisions today about AI adoption that will shape assessment practices for years to come. Students are encountering AI-authored items in high-stakes exams. Legal precedents are being set. Security vulnerabilities are emerging.

That's why we brought together four leading experts from across the globe to launch the Test Community Network's first Research Sprint on AI item authoring—and the conversation that unfolded was nothing short of essential viewing for anyone working in the assessment space.

The Expert Panel: Diverse Perspectives, Shared Urgency

Joining from Patagonia, Chile, Sergio Araneda (Research Scientist at Caveon) brought his expertise in test security and psychometrics, framed by a background in applied mathematics and engineering. From Middle Earth—quite literally down the road from Hobbiton—Karl Hartley (Director of AI Learning Strategy at Epic Learning) shared fresh insights from New Zealand's largest research into generative assessments: a chunky 114-page report packed with data and real-world findings.

Miia Nummela-Price, a Fractional General Counsel specialising in the intersection of law, education and technology, provided the legal compass we all need as we navigate uncharted regulatory waters. And from Manchester, Neil Wilkinson (Director of Product and Innovation at Certiverse) brought years of high-stakes assessment creation experience from Pearson, now focused on making complex exam creation processes straightforward and accessible.

Together, they explored a landscape that's evolving faster than any of us anticipated.

The Legal Landscape: Still Being Written

Miia opened with a reality check that should give pause to any organisation rushing headlong into AI adoption: the legal frameworks governing AI-generated content are still developing. "These conversations will continue for the foreseeable future," she noted. "I'm not sure if I'm going to be here to witness the end position and very finite answer to all of these."

The key questions keeping legal teams awake at night?

Who owns AI-generated assessment items? Is it your organisation, or does the AI provider retain rights?
What happens when AI-authored items prove flawed? Where does liability sit in a world of algorithmic authoring?
How do copyright laws apply when content is generated rather than created?
What contractual protections do you need in vendor agreements?

Knowing how case law develops—with hundreds of years of precedent in other areas of law—Miia emphasised that we're witnessing the early stages of a legal evolution. The organisations that approach this with curiosity, rigorous contract review, and a willingness to adapt will be best positioned as frameworks solidify.

Quality Assurance: The Human Remains Essential

One of the most powerful themes to emerge was the critical, ongoing role of Subject Matter Experts (SMEs). As Neil put it bluntly: "You still need to treat your SMEs really well... If you lose all your SMEs because they think they're about to be replaced by AI, then all of the things that we've talked about in terms of creating better items will just fall apart."

This isn't about AI replacing humans—it's about augmentation.

Karl's research revealed a game-changing approach: conducting deep discovery sessions with SMEs before engaging AI tools. "We were able to generate content closer to the SMEs' expectations and sped up review times," he explained. By grounding the AI's knowledge in the SME's expertise from the start, organisations achieved better alignment and faster workflows.

Several community members shared their experiences in the chat:

"AI is good for ideation and as an assistant, but it's no replacement for an actual SME with experience. They're particularly bad at coming up with realistic distractors because they don't 'understand' context."

"The issue with AI as a SME is that it doesn't have—and really can't gain—experience. A good SME draws on their experience to create items that reflect the actual job role."

The consensus? AI excels as an advisor, editor, and ideation partner. But the lived experience, contextual understanding, and professional judgement that SMEs bring cannot be replicated by algorithms—at least not yet.

The Distractor Dilemma

An interesting debate emerged around one of the trickiest aspects of multiple-choice item writing: creating effective distractors.

"SMEs also not great at writing good distractors," one participant noted—a refreshingly honest admission. But as another quickly countered: "Good distractors can be difficult to create, but AI hallucination isn't the same as a 'good' distractor, either."

Here's where AI's limitations become clear. A good distractor represents a common misconception, a plausible error in reasoning, or a partial understanding of the concept. It requires understanding how people think incorrectly about a topic—something that comes from teaching experience, not pattern recognition in training data.

AI hallucinations might produce wrong answers, but they're often nonsensically wrong rather than instructively wrong. The difference matters enormously for the educational value of an assessment.

Difficulty Calibration: A Persistent Challenge

When asked if difficulty level of AI-created items is ever an issue, Neil's response was unequivocal: "Yes! AI is definitely better at creating easier items, and more knowledge-based items. BUT. So are SMEs—it's a real skill creating harder items."

Karl's research confirmed this challenge: "My research had issues with getting the difficulty right and it will often vary throughout the assessment."

Interestingly, one of Karl's breakthrough findings involved getting AI to create model answers and justify them. This approach "gave much better results"—suggesting that forcing the AI to articulate its reasoning improves the quality of both the question and the answer options.

But Neil also reminded us that humans face this same challenge. "It's often a fun game to play when reviewing content to 'guess the writer'—people often write in the same style about similar content." The consistency issue isn't unique to AI; it's a fundamental challenge in content creation.

Security Vulnerabilities: New Risks to Consider

Sergio brought his test security expertise to bear on a crucial question: what new vulnerabilities does AI authoring introduce?

His key insight? We need to become more rational about our expectations of Large Language Models (LLMs). "I think right now, in general, we have been using LLMs and crossing fingers that it's going to work," he observed. "But I think there is a way in which we are a little bit more rational about this belief."

His recommendation: dive into the benchmarks used to train the model and understand exactly how the LLM was built. This gives you a proper foundation for expectations.

For example, if you're asking an LLM to estimate item difficulty, you need to ask: was the model trained for this task? If not, why would we believe it can do it accurately? Understanding what tasks the model has been trained on—and how similar they are to your use case—is fundamental to responsible deployment.

The community raised important questions about procurement: "How do you make good judgements at the procurement stage when evaluating supplier models? What's the differential of models and how do you establish how 'good' their models are?"

These aren't just technical questions—they're strategic ones that will determine whether your AI implementation succeeds or fails.

Practical Implementation: What Actually Works

The Death of Prompt Engineering (Maybe)

Neil offered a provocative take: "I feel like prompting might be on the way out, at least temporarily." Rather than crafting massive, verbose prompts trying to accomplish everything in one go, he's seeing better results from breaking tasks down into smaller, individual components.

"We've had a lot of joy using AI to essentially take on a review role. You can use AI or a human to write an item, but then have a separate review where you're doing specific things, and that seems to work really effectively."

This modular approach—chaining together different models for different jobs—offers flexibility and better outcomes than the "one prompt to rule them all" strategy.

Temperature Settings and Non-Determinism

Karl shared his team's evolution: "We used Temperature 0.4, but as more advanced models came along, moved it to 0.7 when we created personalised assessments—modern AI follows instructions better."

The goal? "You want creativity, but don't want hallucinations."

He offered a practical tip: "Ask an LLM a dad joke and see how many different ones you get." This simple test reveals the non-deterministic nature of the system—a critical characteristic that too many users overlook.

As Karl emphasised: "Test your prompts, run them ten times. And if they're all different, your prompt is not very good. Test, test, test."

Defining Quality: A Fundamental Shift

Perhaps the most transformative insight came from Karl's reflection on the entire quality assurance process: "I recommend really spending time defining what good looks like. We've got so used to correcting items, not defining it."

Instead of focusing on fixing bad questions, he advocates for articulating what makes a question excellent. This shifts the entire paradigm from reactive correction to proactive creation—and it resonated strongly with participants.

As one noted: "I like this idea—articulate why this question is good rather than trying to fix the bad ones. I will try this."

This seemingly simple shift has profound implications. When you can clearly define quality, you can:

Train AI systems more effectively
Give SMEs better guidance
Create more consistent review criteria
Build better validation processes

The Regulatory Question

An important question emerged: "Do any regulators/accreditation organisations block the use of AI to create items?"

While the panel didn't provide specific examples, Miia's earlier points about evolving legal frameworks suggest this is a landscape in flux. Some organisations may have policies in place; others are still developing them. The California Bar exam incident mentioned in the chat suggests that perception and trust issues exist even where formal blocks don't.

This underscores the importance of transparency, documentation, and quality assurance processes that can withstand scrutiny—whether from regulators, accreditors, or the public.

Key Takeaways: The Emerging Consensus

As the conversation drew to a close, several key principles emerged:

1. AI-Forward, Not AI-First
Tim referenced Play 3 of his AI Adoption Playbook: be AI-forward, not AI-first. The human must remain in the loop, but that doesn't mean resisting AI—it means thoughtfully integrating it.

2. SMEs Remain Gold
Multiple panelists and participants emphasised this point. Karl's declaration—"SMEs are gold!"—captured the sentiment perfectly. Make them more effective, not obsolete.

3. Test Everything
From non-deterministic outputs to difficulty calibration to security vulnerabilities, the testing mindset must be relentless.

4. Define Quality First
Before you start correcting AI outputs, define what excellence looks like. This clarity transforms every downstream process.

5. Stay Adaptable
As Neil noted, we might look back on today's advice in twelve months and find it completely outdated. Build with flexibility in mind.

6. Be Rationally Optimistic
Understand what LLMs can actually do based on their training, not what we hope they can do. Set expectations accordingly.

What the Community Needs

Participants were clear about what outputs would help them most:

Guides and playbooks with practical best practices
Model evaluation frameworks for procurement decisions
Quality definitions and assessment criteria
Case studies from real implementations
Technical guidance on working with LLMs
Process maps showing where AI helps most (and least)

This is exactly what the Research Sprint aims to deliver.

Introducing the Research Sprint: Your Opportunity to Shape the Future

Here's where you come in.

This webinar wasn't just a one-off conversation—it was the launchpad for something bigger. The Test Community Network Research Sprint on AI item authoring is now open, and we need your voice in it.

What Is a Research Sprint?

Think of it as a collaborative, asynchronous knowledge-building exercise. It's a dedicated platform where the assessment community comes together to:

Share experiences using AI to create items to date
Explore technical perspectives of working with LLMs
Document educational processes and how AI can help at all stages
Ask questions you're grappling with
Learn from peers facing similar challenges
Co-create practical guidance that the whole community can use

Unlike a traditional webinar that ends when the session closes, the Research Sprint continues. You contribute when it suits you. You engage with the insights others share. The platform uses AI to anonymously aggregate insights and identify patterns, creating a living knowledge base that grows richer with each contribution.

Who Should Get Involved?

Everyone.

Whether you're:

Just starting to explore AI authoring and have questions
Already implementing AI tools and have lessons to share
Researching this space and want to connect with practitioners
Concerned about the implications and want to understand risks
Excited about the possibilities and want to push boundaries
Evaluating vendor solutions and need decision frameworks
Working on policy or governance and need real-world context

Your perspective matters. Your questions matter. Your experiences—successful or otherwise—matter.

What Will We Explore Together?

The sprint is intentionally broad because AI touches every stage of the item development lifecycle:

Technical dimensions:

Which models work best for which tasks?
How do you configure temperature and other parameters?
What prompting strategies actually deliver results?
How do you chain models together effectively?
What validation approaches catch AI errors?

Educational and psychometric dimensions:

How do you maintain construct validity?
What review processes work when AI is the author?
How do you calibrate difficulty?
Where does human expertise remain non-negotiable?
How do you train SMEs to work effectively with AI?

Legal and ethical dimensions:

What should contracts with AI vendors include?
How do you handle copyright and ownership?
What disclosure obligations exist?
How do you ensure fairness and avoid bias?
What governance frameworks are needed?

Practical implementation dimensions:

How do you build business cases for AI adoption?
What change management approaches work?
How do you measure ROI?
What procurement criteria matter most?
How do you scale responsibly?

The Output: Practical Guidance Built By the Community, For the Community

At the end of the sprint, we'll synthesise all the contributions into practical guidance that serves the assessment community. Not academic papers locked behind paywalls. Not vendor marketing materials. But real, actionable, community-validated guidance.

Imagine having access to:

A comprehensive playbook for AI item authoring implementation
Model evaluation frameworks based on real procurement experiences
Quality definition templates you can adapt for your context
Risk assessment checklists covering legal, security, and educational dimensions
Case studies showing what worked (and what didn't)
Technical guidance on LLM selection and configuration

This is what we're building together.

The Urgency Is Real

Let's return to where we started: AI tools are already authoring test items in classrooms and examination halls worldwide.

The question isn't whether to engage with this technology—that ship has sailed. The question is whether we'll engage thoughtfully, collaboratively, and with the rigour this moment demands.

Every day that passes without clear guidance is another day where organisations make decisions in isolation, potentially repeating mistakes others have already made. Another day where students encounter items of uncertain quality. Another day where legal and security vulnerabilities go unaddressed.

We have an opportunity—right now—to shape this future together. To pool our knowledge, share our experiences, and build something greater than any one organisation or expert could create alone.

The Research Sprint is that opportunity.

Join Us

Don't let this moment pass.

Whether you have answers or questions, successes or failures, excitement or concerns—your contribution matters.

Join the Research Sprint: https://research.testcommunity.network/

Listen to the full webinar: Spotify

Download the AI Adoption Playbook: https://testcommunity.network/landing/aiplaybook

Explore Karl's Research: https://concove.ac.nz/discovery-hub/ai-generated-assessments-for-vocational-education-and-training/

The assessment community is at a crossroads. Let's navigate it together.

Let's build the guidance we all need.

Let's ensure that when we look back on this moment, we can say we got it right.

The Research Sprint is live. The conversation has started.

Will you add your voice?

Special thanks to our expert panelists—Sergio Araneda, Karl Hartley, Miia Nummela-Price, and Neil Wilkinson—and to everyone who participated in the webinar discussion. Thanks also to Stuart Martin for his support and to Vretta for sponsoring this episode.