blog

We Built a Model That Writes Better Multiple Choice Questions. Here’s the Evidence.

Written by Will Cummings | Dec 26, 2025 10:14:15 PM

Writing good multiple-choice questions (MCQs) is well studied—and well known to be difficult. Decades of item-writing research document common flaws: cueing, implausible distractors, ambiguity, and questions that reward test-taking strategies rather than understanding. Out-of-the-box LLMs make many of the same errors as human question writers, so we have created a fine-tune for question writing designed specifically to avoid these structural errors.

Background

Previous research has identified a small, recurring set of item-writing flaws (IWFs) that systematically weaken MCQ validity (https://doi.org/10.1007/s10459-004-4019-5, arXiv:2503.10533). Common examples include:

  • Cueing effects, where the correct answer stands out (e.g., it is longer, more specific, or more precise than distractors)

  • Implausible or heterogeneous distractors that can be eliminated without understanding the content

  • Absolute terms (“always,” “never”) that unintentionally signal correctness or incorrectness

  • Grammatical or semantic mismatches between the stem and answer choices

  • Construct-irrelevant difficulty, where surface features—not reasoning—drive item difficulty

These flaws matter because they allow students to answer correctly using test-taking strategies rather than the targeted knowledge or skill. Even a single cueing flaw can meaningfully distort item difficulty and discrimination, which is why standard item-writing guidelines emphasize avoiding them altogether.

How AI Models Stack Up on IWFs

We've fine-tuned a model to avoid some of these flaws for writing multiple choice questions. We compared our model with two other commercial models by evaluating them using a rubric to identify IWFs. All three models have similar baseline capabilities and price.

To compare question quality across models, we evaluated both average scores and quality thresholds that align with what a teacher would care about in real use.

The first chart shows average quality (zero-to-one, higher is better). Our model scores higher overall, but averages only tell part of the story.

The second chart looks at quality thresholds: how often each approach produces questions that meet higher standards (≥75%, ≥90%, and flawless). As thresholds rise, baseline approaches drop off quickly. Our model maintains quality much more consistently.

That consistency matters. In practice, teachers need reliably sound items, not occasional successes.

We can isolate and correct specific item writing flaws. For example, LLM based systems are inclined to write conspicuously long correct answers. These charts shows how our model significantly reduces the rate of this flaw (lower is better):

Limitations. This analysis is based on a relatively small sample (50 outputs per model) of English-language MCQs evaluated against our internal IWF rubric.

QuestionWell's proprietary in-house AI platform allows fine-grained control over model outputs along with ongoing quality control to measure improvements and guard against regressions. Our system enables us to push the state of the art and continuously delivery higher-quality outputs for our customers. This new model will be available to ALL QuestionWell users in the coming weeks, at no extra charge.