Item writing flaws contaminated one of the most widely used AI datasets

Written by Will Cummings | Jan 22, 2026 9:41:52 PM

Much like how teachers evaluate human students, AI researchers put models through their paces using a suite of tests, called benchmarks. Benchmarks can be complex, like SWE-bench which grades models on how well they fix bugs in a large piece of software, or simple question/answer pairs. There are benchmarks for math, science, abstract reasoning, playing chess and all sorts of other things.

As discussed in my previous blog post (if you want more background on item writing flaws, read this), We Built a Model Thats Writes Better Multiple Choice Questions:

We noticed that the LLMs we were using were making many of the same mistakes that humans made while writing multiple choice questions based on a text.

As an example of how AI datasets can be poisoned by item writing flaws, I present MMLU-Pro, a public benchmark which presents the LLM with multiple choice questions across a number of domains. Eric Tramel, a research scientist at NVidia, recently discovered this dataset was poisoned by an obvious cueing error: correct answers are consistently proceeded by a space!

We can see selecting the answer proceeded by a space doubles the number of correct answers compared to random selection in Math, Physics and Chemistry. Another user, Peter Barnett, a MIRI (Machine Intelligence Research Institute) researcher, points out another cueing problem that may be more familiar to educators: the longest answer is correct.

Naively choosing the longest answer every time provides a similar boost in performance, this time across all domains in the benchmark. If these types of errors persist, even in popular and well regarded public benchmarks, it's likely they're also pervasive in the training data. Research has shown that LLMs learn shortcuts to cheat on this benchmark, such as Changing Answer Order Can Decrease MMLU Accuracy (arXiv:2406.19470v2). The benchmark will not be able to differentiate between a model which truly knows the correct answer to an important biology question, and which model has merely learned to select the conspicuously long answer.

Sound familiar? This is the same problem teachers face when evaluating students with multiple choice questions.

Don't worry, we're on top of it. Learn how we fix cueing errors like "longest answer" in our models.

View full post