Reliability, Validity and Fairness
Mary Pitoniak on the Enduring Pillars of Good Assessment
Mary Pitoniak on the Enduring Pillars of Good Assessment
AI is the hot topic at every assessment conference this year. Personalised testing, automated scoring, generative item writing — the possibilities feel endless. But in a conversation that kept coming back to first principles, veteran psychometrician Mary Pitoniak offered a gentle reminder: the technology changes, but the questions we need to ask about a test do not.
Mary recently retired after thirty-nine years in assessment, twenty-four of them at ETS, and has now founded Pitoniak Educational Measurement to continue training and consulting. Alongside her long-time collaborator Linda Cook, she was also named one of the first women editors of Educational Measurement, now in its fifth edition — work that just won the National Council on Measurement in Education's Award for Exceptional Achievement in Educational Measurement.
"Reliability is consistency of measure. Validity, do scores reflect what you're trying to measure? Fairness, is there a level playing field for all test takers?"
— Mary Pitoniak, President, Pitoniak Educational Measurement
That triad — reliability, validity and fairness — is the backbone of the entire field. Mary is quick to acknowledge that describing something complicated and contested by top theorists in simple terms is harder than defending a doctoral dissertation. But her plain-English summary lands. Reliability, she explains, sets the ceiling: if a test isn't reliable, it cannot be valid. And fairness isn't just a nice-to-have bolted on at the end — it is woven through the entire question of whether the scores mean what we say they mean.
One of the most useful corrections Mary offers is about the language we all slip into. It's tempting to ask whether a test is "valid". But that's not quite the right question.
"Are the interpretations of the test score for the proposed use appropriate? So that's really what you should say, but we all lapse on that."
— Mary Pitoniak, President, Pitoniak Educational Measurement
The distinction matters. A test that works beautifully for one purpose can be wholly inappropriate for another. And fairness is bound up with this too: giving a vocabulary-dense word problem to assess arithmetic might end up measuring a child's English rather than their mathematics. The subtle cases — not the obvious ones — are where bias creeps in. Methodology like differential item functioning has been developed precisely to flag items that may be unfairly advantaging or disadvantaging particular groups.
The conversation turned naturally to personalisation. Could AI finally deliver the long-promised dream of assessments tailored to each test taker — their language, their cultural context, their prior knowledge? Mary sees real potential, and real roots for the idea in the sociocultural reckoning that followed the murder of George Floyd in the US, which prompted harder questions about who tests have historically been built for.
"Standardisation of tests is really standardisation to the dominant culture."
— Mary Pitoniak, President, Pitoniak Educational Measurement
But enthusiasm has to be tempered by stakes. A personalised formative assessment used to guide classroom instruction is one thing. A personalised licensure exam is quite another. Would you feel comfortable learning that your doctor had been assessed on test questions customised to their background? Did they cover the full domain? What happens when they practise in an environment very different from the one they were tested for? Mary's answer, pragmatic as ever, is that personalisation is not yet ready for prime time in high-stakes contexts — though the research should absolutely continue.
The same calibrated scepticism runs through her view of generative AI more broadly. ETS has used automated scoring for decades, and AI in assessment is not new. What is new is abundant access — and with it, a rush to apply the technology to every problem in sight.
"It's like giving the keys to a Maserati to someone who barely knows how to drive."
— Mary Pitoniak, President, Pitoniak Educational Measurement
Her worry is not the technology itself but the combination of powerful tools and organisations without psychometric expertise. A vendor offering to build you an assessment with no psychometricians on staff and no test development process can cheerfully administer a very unfair test that affects people's lives. The remedy isn't to pull up the drawbridge on innovation — it's to keep anchoring back to fairness, validity and reliability, and to ask the harder, more boring questions about why we're doing any of this in the first place.
Mary's next chapter is all about meeting people where they are. Through Pitoniak Educational Measurement she is continuing to offer training, standard setting and process evaluation for testing organisations around the world — translating deep technical expertise into something accessible, which, as Tim notes at the end of the episode, is a gift the industry quietly needs more of.
Read the book: Educational Measurement (Fifth Edition) — Edited by Linda L. Cook and Mary J. Pitoniak (Oxford University Press, open access)
Connect with Mary Pitoniak on LinkedIn: linkedin.com/in/mary-pitoniak-3560a691