Multiple choice tests: why you shouldn’t panic

Many undergraduate students in the social and life sciences go through 4 or more years of university education utterly convinced that multiple choice exams are Satan’s favorite testing format. Drawn up by diabolical, sadistic demons (sometimes termed “professors”), questions on multiple choice exams are invariably ambiguous, unfair, and out for (the student’s) blood. Personally, I have my own vivid and unpleasant memories of the teeth-gnashing, expletive-laden tirades I went through not so very long ago whenever I received an exam back with questions marked wrong that I felt I should have received credit for. But now that I’m an older and marginally wiser graduate student with several statistics and research methods classes under my belt, I appreciate what I couldn’t back then: there’s nothing wrong with multiple choice exams (most of the time!). Multiple choice exams are fine. They’re better than fine–they’re great. The problem isn’t the exams; it’s that no one ever bothers to explain the logic of the format to students at a point in time when it actually matters (e.g., at the beginning of the semester, before the first exam).

Now that I’m in the position of having to grade students’ multiple choice exams and explain their mistakes to them during office hours, I often find myself wishing I had a concise explanation as to why they really shouldn’t feel bad about getting Question Number 26 wrong, and why it’s still a perfectly good question even if they felt the wording was ambiguous. There are plenty of guides about how to take multiple choice exams floating around on the web, but what I’m after is a Damage Control Guide explaining how to defuse tension associated with students’ perceptions that they got screwed over on the last test. So rather than wait around indefinitely, I thought I’d write one, in the hopes others might find it useful.

The overarching point students need to understand and accept about multiple choice exams is that they are almost always made up of mostly bad questions, and that this is in fact mostly a good thing. By ‘mostly’ bad I mean that almost any question on a multiple choice exam is going to be ambiguous to some degree. Wording that seems crystal clear to one student is going to seem horribly vague to another; a question to which one students thinks B is unambiguously the right answer may confuse and anger another student, who think B, C, and D are all perfectly acceptable answers based on what the textbook says. Ideally, of course, such ambiguity shouldn’t be so pervasive as to completely paralyze and perplex the majority of students taking a test. However, some measure of ambiguity and even outright error is unavoidable.

It also turns out not to be a very big deal. It can be demonstrated mathematically that even a multiple choice test made up of mostly bad questions can still provide a very good measure of student’s knowledge of the tested material, provided that (a) there’s at least a weak correlation between students’ scores on individual questions and their overall knowledge, and (b) there are enough questions on the exam.

In practice, both of these numbers can usually be surprisingly modest. The reliability of a measure (or multiple choice test) is most commonly estimated using Cronbach’s alpha, which, in one form, allows us to compute a reliability coefficient as a function of two quantities: the number of items (or questions) on the test, and the average correlation between items. The formula is as follows:

Cronbach's alpha formula

Where N is the number of items and r is the average inter-item correlation. Given this formula, it’s easy to estimate the reliability of a hypothetical test. For example, a test with 30 questions and an average inter-item correlation of only .2 (equivalent to an average of only 4% shared variance between items!) will have a reliability coefficient of .88. In general, anything over .85 or so is considered good, so even by a creating a test with only 30 questions and weakly inter-correlated items, you can see that an instructor can end up with a very reliable test. Given that grades are typically derived from more than one test, the reliability of students’ overall grades will generally increase further. Moreover, if you were to increase the number of items on a given test to 90, reliability jumps to .96, or near perfect.

Note that because an average inter-item correlation of .2 is pretty low, the above calculation essentially gives instructors a free pass to have several bad questions on each exam. The net effect of poorly wording a question is to reduce its ability to correlate with other questions, because whether or not a student gets a bad question right depends on chance rather than knowledge. So smarter students are no more likely to get a bad question right than are poor students. Just how many bad questions one can afford to have on a test depends on how inter-correlated the good questions are; but it’s clear to see that even on a test of 30 questions with an average inter-item correlation of .2, having 4 or 5 questions that are completely uncorrelated with the rest of the test would have relatively little impact on the overall reliability of the test. And since reliability increases as a function of number of items, any concern about the drop can easily be offset by adding another 10 or 20 items.

Of course, all of this may initially seem like mumbo-jumbo to an irate student who feels they were mortally wronged by ambiguous wording on one or two questions. But it’s useful to explain nonetheless, because students who understand the logic will not only complain less, making your life easier, but will also have a more pleasant college experience, since they won’t spend four or more years feeling persecuted by malevolent instructors.

Having said all of this, there are a couple of important caveats, and one shouldn’t just conclude that any reasonably well-thought out multiple choice test is acceptable for class use. First, bad exam questions (even when there are only a few) do present a genuine problem for a small minority of students, namely those whose performance is at ceiling. If you’re a student who would have performed perfectly on a test made up of clear, relevant, and unambiguously-worded questions, the inclusion of bad items can only hurt you, since you have nowhere to go but down. In contrast, students who score lower in the distribution, say, around 75%, have little to complain about, since it’s entirely possible for their score to increase due to the inclusion of bad questions. Students who score near the bottom would actually experience a beneficial effect, with noise generally increasing their scores. But since the distribution of scores is almost always top-heavy in academic settings (more people pass than fail!), the overall net effect of unreliability is to shift the distribution of scores slightly downwards. In most cases this isn’t a problem since most instructors implicitly account for this (e.g., by making some exams ‘easy’ in order to shift scores upwards), but it’s worth keeping in mind anyway. Even if the reliability of your test is very high, it may still make sense to throw out the worst questions in order to prevent a systematic slip in the distribution.

A second and more important concern is that establishing that a test is reliable doesn’t necessarily mean it’s a valid measure of students’ learning. A reliable test is simply one that measures the same thing consistently. Nothing about the reliability coefficient tells you what that thing is. There are lots of things you could measure consistently in student populations that have little or nothing to do with the course material you’re teaching. For example, if you like to write extremely tricky multiple choice questions that require students to perform rigorous exercises in logic (e.g., “the answer can’t be A, because only one of these answers is right, and A entails that B is true as well”), you may well end up with highly reliable tests. However, these tests may not be valid measures of students’ knowledge of, say, organic chemistry or developmental psychology, because in effect, by turning your exams into an exercise in logic, you’ve loaded the ability to reason abstractly into your questions. In other words, what determines whether students do well on your exams may turn out to be their general level of fluid intelligence, and not the degree to which they’ve studied and assimilated the material. So while an unreliable test is always a lousy test, a reliable test may still be a lousy test. The ability to easily calculate Cronbach’s alpha isn’t an excuse to stop worrying about what your exams are testing for. But it does let you establish that wording problems or ambiguity on some questions don’t have much of an impact on your overall ability to measure students’ performance.

2 Responses to “Multiple choice tests: why you shouldn’t panic”

  1. Computational Neuroscience (and Programming) Blog » Blog Archive » Examens à choix multiples/Multiple Choice Tests

    […] Un billet intéressant sur les examens à choix multiples. […]

  2. Bob Lucas

    Dear Professor,

    I loved your article; you actually wrote it in plain English! My only question is, and this is a selfish one, how does one actually calculate a value for inter-item correlation?

    Weak on statistics,

    Bob Lucas

Leave a Reply