<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Small Gray Matters &#187; tutorials</title>
	<atom:link href="http://www.smallgraymatters.com/category/tutorials/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.smallgraymatters.com</link>
	<description>of brains and their minds</description>
	<lastBuildDate>Fri, 18 Sep 2009 01:27:48 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>why base rates matter</title>
		<link>http://www.smallgraymatters.com/2009/09/13/why-base-rates-matter/</link>
		<comments>http://www.smallgraymatters.com/2009/09/13/why-base-rates-matter/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 02:51:34 +0000</pubDate>
		<dc:creator>small and gray</dc:creator>
				<category><![CDATA[statistics]]></category>
		<category><![CDATA[tutorials]]></category>
		<category><![CDATA[adhd]]></category>
		<category><![CDATA[base rates]]></category>
		<category><![CDATA[cancer]]></category>
		<category><![CDATA[cell phones]]></category>
		<category><![CDATA[death]]></category>
		<category><![CDATA[driving]]></category>
		<category><![CDATA[stimulants]]></category>

		<guid isPermaLink="false">http://www.smallgraymatters.com/?p=56</guid>
		<description><![CDATA[Here are three recent scientific findings you may or may not have heard about:
1. The use of stimulant medications commonly prescribed for ADHD is associated with a nearly 8-fold increase in the likelihood of dying suddenly among children aged 7 &#8211; 19.
2. Gum disease increases the risk of head and neck cancer quite dramatically: for [...]]]></description>
			<content:encoded><![CDATA[<p>Here are three recent scientific findings you may or may not have heard about:</p>
<p>1. The use of stimulant medications commonly prescribed for ADHD is associated with <a href="http://ajp.psychiatryonline.org/cgi/content/full/166/9/992">a nearly 8-fold increase in the likelihood of dying suddenly</a> among children aged 7 &#8211; 19.</p>
<p>2. Gum disease <a href="http://cebp.aacrjournals.org/content/18/9/2406.full">increases the risk of head and neck cancer</a> quite dramatically: for every millimeter of alveolar bone loss (i.e., loss of the bone that surrounds the roots of your teeth), there is a 400% increase in the risk of cancer (note: article requires paid access).</p>
<p>3. People who talk on a cell phone while driving are <a href="http://www.vtti.vt.edu/PDF/7-22-09-VTTI-Press_Release_Cell_phones_and_Driver_Distraction.pdf">1.3 times more likely to have an accident</a> than people who drive without any distractions.</p>
<p>At a cursory glance, all three of these stories seem like pretty bad news. And they are. But one of them is actually much worse than the others. Your job is to decide which one; take a moment to think about it, then read on.</p>
<p>If you&#8217;re like most people, you probably picked either the first or the second story. After all, it&#8217;s pretty terrible to think of children dying suddenly, or of getting cancer of the head and neck. Sudden death implies death for certain, and cancer implies death with a high probability. Most of us generally don&#8217;t see death as a good thing, so we want to avoid those outcomes. Car accidents aren&#8217;t anyone&#8217;s idea of a good time, of course; but at least most car accidents aren&#8217;t fatal. And then there&#8217;s the matter of the differing odds to consider: in the first story, the negative outcome is 8 times as likely, and in the second, it&#8217;s 4 times as likely, but in the third story, it&#8217;s only 1.3 times as likely. Surely then, it&#8217;s more important to avoid taking stimulant drugs and to brush and floss regularly than to worry about talking on a cell phone!</p>
<p>Well, as you might have guessed from the fact that I started the previous paragraph with &#8220;if you&#8217;re like most people&#8230;&#8221;, the truth is actually somewhat counterintuitive. The fact of the matter is that, even if the above stories are completely true (and as far as I know, they are, pending further research), turning off your cell phone when you drive is probably a much, much better way to minimize your chance of dying early than swearing off stimulants or practicing great oral hygiene (though the latter is still important!). The reason is that the information I gave you in the three stories above neglects what&#8217;s probably the most important piece of of all to consider: the base rate (or frequency) of each event occurring.</p>
<p>Let&#8217;s add some context to each of the three stories. Take the first one. It&#8217;s true (at least based on one preliminary study) that kids who take stimulant medications are much more likely to die suddenly than kids who don&#8217;t. But the critical thing to consider is the base rate of sudden death. You probably won&#8217;t be surprised to hear that the odds of dying suddenly are <em>incredibly</em> low when you&#8217;re 7 &#8211; 19 years old. It&#8217;s unclear exactly how low they are, but consider that the study that reported this finding scoured state databases between the years of 1985 and 1996 and still only came up with 564 cases of sudden death. That&#8217;s a tiny, tiny, tiny fraction of the number of kids who make it past 19 years of age in good health. Suppose we say that the probability of sudden death for a kid in this age range is 0.0001% per year. An eightfold increase would mean that the average kid goes from a one in a million chance to just under a one in a hundred thousand chance of dying per year. And of course, it&#8217;s not average kids for whom stimulant medications are prescribed; usually, there&#8217;s a condition (e.g., ADHD) that the drugs are intended to alleviate. When you weigh the increase in the negligible likelihood of sudden death against the very <a href="http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1868385">sizeable benefits conferred by stimulant medications</a>, it&#8217;s clear that this finding isn&#8217;t really cause for alarm. As <a href="http://psychcentral.com/blog/archives/2009/09/01/adhd-stimulants-children-and-sudden-death/">John Grohol notes</a>, &#8220;The finding is of greater interest in trying to understand why it’s occurring at all, not for anyone to make a treatment decision based upon it.&#8221;</p>
<p>What about the second story? Well, you can probably already see where this is going. Head and neck cancer is quite rare, accounting for fewer than 50,000 new cases per year in the United States. In other words, approximately one in every 6000 people will develop head and neck cancer. This of course includes both people who have good oral hygiene and people who don&#8217;t, so the reality is that, even if you have terrible oral hygiene and rampant gum disease, you&#8217;re very unlikely to ever develop head and neck cancer. Conversely, there are other factors that present even greater risk factors for head and neck cancer than gum disease (e.g., smoking). This isn&#8217;t to say that you shouldn&#8217;t brush your teeth, of course; there are plenty of other good reasons to take good care of your gums. It&#8217;s just to say that you shouldn&#8217;t lose any sleep over the prospect of developing head and neck cancer because of your gums. In the grand scheme of things, there are any number of other things you should be much more concerned about.</p>
<p>One of the things you should be much more concerned about, actually, is your risk of having a car accident while talking on your cell phone. Unlike sudden death in children and head and neck cancers in adults, the odds of dying in a car accident are not very small. Worldwide, <a href="http://en.wikipedia.org/wiki/Causes_of_death">approximately 2% of deaths</a> every year are caused by road accidents. And that&#8217;s to say nothing about serious injuries sustained in non-fatal accidents. Put simply, a 1.3-fold increase in the likelihood of enduring car accidents is not trivial. If we do a back-of-the-envelope calculation and assume that the odds of <em>dying</em> in a car accident increase by the same proportion (i.e., that drivers on cell phones don&#8217;t have more serious accidents than drivers off cellphones&#8211;which is debatable), it turns out that you can reduce your overall odds of dying in any given year by about 0.6% just by not talking on your cell phone while driving. Admittedly, that&#8217;s a very loose estimate that&#8217;s based on questionable data and many simplifying assumptions. And it&#8217;s not like it&#8217;s a dramatic reduction by any stretch (which only goes to further illustrate the importance of considering base rates). But the point is, there are probably relatively few lifestyle change you could make this year that would require so little effort for such a large benefit. So take your ADHD meds, brush your teeth regularly, and don&#8217;t talk on your cell phone while driving.</p>
<p>For a nice overview of empirical data on the base rate fallacy, see this <a href="http://www.bbsonline.org/Preprints/OldArchive/bbs.koehler.html">article in BBS</a>. For more blogospheric bloviation on base rates, see <a href="http://www.spaceandgames.com/?p=59">here</a>, <a href="http://news.bbc.co.uk/2/hi/uk_news/magazine/8153539.stm">here</a>, and <a href="http://michaelgr.com/2007/11/24/cognitive-bias-base-rate-fallacy/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.smallgraymatters.com/2009/09/13/why-base-rates-matter/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A primer on power</title>
		<link>http://www.smallgraymatters.com/2006/12/04/a-primer-on-power/</link>
		<comments>http://www.smallgraymatters.com/2006/12/04/a-primer-on-power/#comments</comments>
		<pubDate>Tue, 05 Dec 2006 05:06:23 +0000</pubDate>
		<dc:creator>small and gray</dc:creator>
				<category><![CDATA[methodology]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">http://www.smallgraymatters.com/2006/12/04/a-primer-on-power/</guid>
		<description><![CDATA[I&#8217;d like to title this post “a power primer,” but that’s the title of a 1992 Psychological Bulletin article by Jacob Cohen (the god of power analysis, now deceased). So instead I’ve titled it “a primer on power.” By changing a few words around I’ve very cleverly gone from academic plagiarism to paying homage. (And [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;d like to title this post “a power primer,” but that’s the title of <a href="http://www.education.wisc.edu/elpa/academics/syllabi/2006/06Spring/825Borman/Cohen1992.pdf">a 1992 Psychological Bulletin article by Jacob Cohen</a> (the god of power analysis, now deceased). So instead I’ve titled it “a primer on power.” By changing a few words around I’ve very cleverly gone from academic plagiarism to paying homage. (And it really is one: I think Cohen’s article, and his lengthier works on power, should be required reading for behavioral scientists of all stripes).</p>
<p>Power is one of the most misunderstood and/or underappreciated concepts in scientific research. Simply put, it refers to the probability of detecting an effect in your sample when it is in fact present in the population (i.e., when it’s ‘real’). If your study has, say, 90% power to detect a difference in the length of socks worn by basketball players as compared to soccer players, that means that <em>if there really is a difference</em> between basketball and soccer players’ sock length, there’s a 9 in 10 chance on average that you’ll be able to detect it in your sample.</p>
<p>In general, power is a good thing, and you want to have as much of it as you can. In an ideal world, scientific experiments would have 100% power to detect effects. Unfortunately, that doesn’t happen in the real world, because to have 100% power (i.e., complete certainty), you’d need to sample the entire population of interest, which isn’t very practical (that’s a lot of players, and twice as many socks). In practice, researchers’ sample sizes are constrained by resource considerations. And so, as a result, is power. Any time you conduct an experiment with a finite sample, you’re taking the risk that you might miss an effect even if it really does exist, simply because of blind (mis)fortune. And in general, the smaller your sample, the greater the probability of you missing an effect. This idea is intuitive enough to most people: it seems pretty obvious that if you want to know whether men are taller than women, you don’t want to base your judgment on the difference in height between just one man and one woman. If you did, you&#8217;d run the risk that you just happened to pick a particularly short man and/or a particularly tall woman. The more men and women you measure, the more the random variations from the mean average out, and the smaller the odds of mistakenly concluding that there’s no gender difference in height.</p>
<p>Where confusion starts to set in (and the impetus for this post) is that the intimate link between sample size and power often leads people (including many scientists) to suppose that there’s a single ‘right’ sample size for all research studies of a particular kind. It’s not uncommon to hear people say things like, “we can&#8217;t trust that study because it&#8217;s based on only 50 people! They need at least 300 to be able to say anything meaningful about the general population!” (Actually this sort of statement also betrays another kind of confusion that relates to the difference between Type I and Type II errors, but that’s a separate issue). The problem is that statistical power depends not only on sample size, but also on two other numbers: the size of the effect, and the stipulated false positive rate (also referred to as alpha, or the Type I error rate).</p>
<p>The importance of the first of these—effect size—is easy to see intuitively. Suppose that the average height difference between men and women was 2 feet rather than several inches. How hard would it be to detect that difference and conclude it exists? Not very. A group of curious alien taxonomists wouldn’t need to abduct very many humans before they figured the gender difference out, simply because the vast majority of men would be taller than the vast majority of women, and the difference would hit the aliens right between the antennae. On the other hand, if the mean height difference was only 1/10th of an inch, our aliens would need to abduct a lot of humans and measure them very carefully before they’d be in a good position to claim that a height difference exists. Simply put, if the effect you’re looking for is large, it takes fewer subjects in order to detect it. Or, more formally, one’s power to detect an effect increases in proportion to the magnitude of the effect, when holding sample size constant.</p>
<p>The second parameter, false positive rate, is somewhat less intuitive. The basic idea is that, since sampling is random and error necessarily creeps in, on rare occasions, researchers are going to end up concluding that an effect exists in the population even though it doesn’t really. Just how often such errors occur is typically a matter of stipulation: scientists will decide that they can accept a false positive occurring, say, 1 out of every 20 times, and adjust their statistical tests accordingly. Conventionally, the false positive rate is set to 5% (and significance tests are therefore conducted at p < .05). Because the convention is so strong, it’s often easy to overlook the false positive rate in power calculations and just default to the standard 5% level. Nonetheless, there is a relationship: the more conservative your statistical test (i.e., the smaller the false positive rate you're willing to accept is), the lower your power gets. In less technical terms, it's kind of like saying that if you only want to be <em>fairly</em> sure that an effect holds true, you don&#8217;t need to look very hard. But if you want to be <em>really</em> sure, you need to double and triple-check to make sure. And double and triple-checking requires more observations (i.e., more subjects.)</p>
<p>Given that power depends only on these two parameters (sample size and false positive rate), how much power is enough? It’s widely accepted that a reasonable level of power is 80-85%. I say ‘widely accepted’ because when people stop to think about what level of power they find acceptable, their answer tends to be in that ballpark (i.e., 4 times out of 5, your experiment will detect the effect you want if it really exists). But that’s not to say that most studies actually <em>have</em> that level of power in practice. One of the most remarkable findings (and one that’s been demonstrated over and over again) made by statisticians interested in power is that an absurdly large proportion of studies in many disciplines simply don’t have the necessary power to detect the effects they hypothesize. In the article I linked to at the beginning, Jacob Cohen points out that an analysis he conducted in 1960 indicated that the average social psychology study had only 48% power to detect moderate-sized effects. In Cohen’s words, “the chance of obtaining a significant result was about that of tossing a head with a fair coin” (p. 155). And that’s on<em> average</em>; presumably there are a good number of studies that have set out to identify effects they have <em>no real chance of detecting even if they’re actually present in the population</em>.</p>
<p>Cohen then went on to note that other statisticians conducting similar reviews have shown no improvement in the average level of power in the decades since. For anyone actively involved in research—or even to casual consumers of science—this should raise red flags all over the place. There really is no excuse for failing to do a simple power calculation before beginning to collect data. It’s not as though power calculation is a tedious process: all you have to do is plug two or three numbers into an online worksheet, and poof, you get your answer instantly. And yet many, maybe even most, scientists fail to do so.</p>
<p>In fairness, doing a power calculation isn’t quite <em>that</em> easy, because you rarely know the exact size of the effect you’re seeking. If you did, you probably wouldn’t need to do the study in the first place! While it’s easy to decide you’d like your study to have, say, 80% power, it’s not so easy to come up with a reasonable estimate of effect size.</p>
<p>Suppose for example that we want to know if there’s a correlation between people’s mood and the amount of television they watch daily. Let’s stipulate our power has to be around 80% (we don’t want to do our study if we don’t think there’s at least a 4 in 5 chance of detecting an effect), and we’ll test our hypothesis at the conventional level of p < .05. How many subjects do we need to collect data from? Well, depends. If the correlation between mood and television watching in the general population is <em>large</em> (canonically, around r = .5), we’re only going to need 29 people to have an 80% chance of detecting it. If it’s <em>medium</em> (say, r = .3), we’re going to have to round up 85 people. But if it’s only a <em>small</em> effect (say, r = .1, or an overlap of only 1% of the total variance in each measure), we’re faced with the daunting prospect of chasing down 785 subjects! Note that in all 3 of these cases, we’re assuming that there <em>really is a correlation between mood and television-watching</em>. The only difference is how strong that effect is.</p>
<p>Of course, power calculations don’t always have to mean bad news. For example, in my area of research (functional neuroimaging), power calculations are often quite comforting. It’s an interesting quirk that people often criticize imaging studies for having small samples, when in fact imaging studies probably don’t have lower power on average than other kinds of studies (at least for standard experimental, within-subject analyses). The knee-jerk reaction is understandable though, because many psychologists (particularly in social or personality psychology) are used to working with samples in the hundreds. If that’s your background, it’s no surprise that when you come across neuroimaging studies that used samples of only 15 subjects (a pretty standard size), you’re going to think something’s horribly wrong.</p>
<p>In fact, there’s nothing wrong, because it turns out (fortuitously!) that effect sizes in functional neuroimaging studies tend to be huge. It’s not uncommon to see effect sizes around d = 2 (d is a standardized measure of effect size popularized by Cohen; it’s measured in standard deviations, so a d of 2 means the difference in neural activation between two experimental conditions is around 2 standard deviations). Effects that large are unheard of in most other disciplines. Consider that Cohen himself considered anything above d = 0.8 a ‘large’ effect (this is just a heuristic of course—the meaning of ‘large’ differs considerably across research areas!).</p>
<p>A quick power calculation reveals that a study with 12 subjects has essentially 100% power to detect an effect size of 2 at p < .05. Basically, if the population effect really is that big, you’re not going to miss it. In fact, with only 2 subjects, you’d still have an 88% shot of detecting it. This explains why early neuroimaging studies that often had only 3 or 4 subjects were able to obtain replicable results. In the early days, when little was known about the relationship between specific cognitive tasks and neural activity in humans, researchers used very broad experimental task contrasts specifically intended to elicit very large, very obvious changes in activation (e.g., comparing activation during a working memory task to a passive resting state). The effects were (not surprisingly, in hindsight) enormous. As time goes on and our knowledge of the functional neuroanatomy of cognition builds up, hypotheses become more subtle, and effect sizes diminish, requiring larger samples.</p>
<p>Of course, imaging studies usually don’t test for effects at p < .05, for reasons I won’t go into here (mainly the need to correct for multiple comparisons). Still, even at p < .001, a study with 15 subjects has 70% power. That’s not great, but it’s a comparable level to what you’ll find in many behavioral studies. Bump the sample up to 20 subjects, and power is now 92%, which is more than acceptable.</p>
<p>Hopefully, these example make clear the importance of (a) conducting power calculations <em>before</em> starting to collect data, and (b) having some reasonable notion as to what the population effect size might be (e.g., based on related effects that have already been identified). Even if you&#8217;re never going to collect any data yourself, and just want to be an informed consumer of scientific literature, it pays to know something about power. Remember: effect size matters. The fact that a study only has 10 people doesn&#8217;t necessarily mean it&#8217;s too small to provide meaningful data. Conversely, a study can have thousands of subjects and still be underpowered.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.smallgraymatters.com/2006/12/04/a-primer-on-power/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multiple choice tests: why you shouldn&#8217;t panic</title>
		<link>http://www.smallgraymatters.com/2006/08/26/multiple-choice-tests-why-you-shouldnt-panic/</link>
		<comments>http://www.smallgraymatters.com/2006/08/26/multiple-choice-tests-why-you-shouldnt-panic/#comments</comments>
		<pubDate>Sat, 26 Aug 2006 16:32:32 +0000</pubDate>
		<dc:creator>small and gray</dc:creator>
				<category><![CDATA[academics]]></category>
		<category><![CDATA[tutorials]]></category>

		<guid isPermaLink="false">http://www.smallgraymatters.com/2006/08/26/multiple-choice-tests-why-you-shouldnt-panic/</guid>
		<description><![CDATA[Many undergraduate students in the social and life sciences go through 4 or more years of university education utterly convinced that multiple choice exams are Satan’s favorite testing format. Drawn up by diabolical, sadistic demons (sometimes termed “professors”), questions on multiple choice exams are invariably ambiguous, unfair, and out for (the student’s) blood.  Personally, [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">Many undergraduate students in the social and life sciences go through 4 or more years of university education utterly convinced that multiple choice exams are Satan’s favorite testing format. Drawn up by diabolical, sadistic demons (sometimes termed “professors”), questions on multiple choice exams are invariably ambiguous, unfair, and out for (the student’s) blood.  Personally, I have my own vivid and unpleasant memories of the teeth-gnashing, expletive-laden tirades I went through not so very long ago whenever I received an exam back with questions marked wrong that I felt I should have received credit for. But now that I’m an older and marginally wiser graduate student with several statistics and research methods classes under my belt, I appreciate what I couldn’t back then: <em>there’s nothing wrong with multiple choice exams </em>(most of the time!). Multiple choice exams are fine. They’re better than fine&#8211;they’re great. The problem isn’t the exams; it’s that no one ever bothers to explain the logic of the format to students at a point in time when it actually matters (e.g., at the beginning of the semester, before the first exam).</p>
<p class="MsoNormal">Now that I’m in the position of having to grade students’ multiple choice exams and explain their mistakes to them during office hours, I often find myself wishing I had a concise explanation as to why they really shouldn’t feel bad about getting Question Number 26 wrong, and why it’s still a perfectly good question even if they felt the wording was ambiguous. There are <a href="http://www.google.com/search?hl=en&#038;lr=&#038;q=multiple+choice+strategies&#038;btnG=Search">plenty of guides</a> about <a href="http://www.studygs.net/tsttak3.htm">how to <em>take</em></a> multiple choice exams floating around on the web, but what I’m after is a Damage Control Guide explaining how to defuse tension associated with students’ perceptions that they got screwed over on the last test. So rather than wait around indefinitely, I thought I’d write one, in the hopes others might find it useful.</p>
<p class="MsoNormal">The overarching point students need to understand and accept about multiple choice exams is that they are almost always made up of <em>mostly bad </em>questions, and that this is in fact <em>mostly a good thing</em>. By ‘mostly’ bad I mean that almost any question on a multiple choice exam is going to be ambiguous to some degree. Wording that seems crystal clear to one student is going to seem horribly vague to another; a question to which one students thinks B is unambiguously the right answer may confuse and anger another student, who think B, C, and D are all perfectly acceptable answers based on what the textbook says. Ideally, of course, such ambiguity shouldn’t be <em>so </em>pervasive as to completely paralyze and perplex the majority of students taking a test. However, some measure of ambiguity and even outright error is unavoidable.</p>
<p class="MsoNormal">It also turns out not to be a very big deal. It can be demonstrated mathematically that even a multiple choice test made up of mostly bad questions can still provide a very good measure of student’s knowledge of the tested material, provided that (a) there’s at least a weak correlation between students’ scores on individual questions and their overall knowledge, and (b) there are enough questions on the exam.</p>
<p class="MsoNormal">In practice, both of these numbers can usually be surprisingly modest. The reliability of a measure (or multiple choice test) is most commonly estimated using Cronbach’s alpha, which, in one form, allows us to compute a reliability coefficient as a function of two quantities: the number of items (or questions) on the test, and the average correlation between items. The formula is as follows:</p>
<p class="MsoNormal"><a href="http://en.wikipedia.org/wiki/Cronbach's_alpha"><img title="Cronbach's alpha formula" alt="Cronbach's alpha formula" src="http://upload.wikimedia.org/math/4/6/6/4668d9e129a9651a64fa0031b8ce7b2c.png" /></a></p>
<p class="MsoNormal">Where N is the number of items and r is the average inter-item correlation. Given this formula, it’s easy to estimate the reliability of a hypothetical test. For example, a test with 30 questions and an average inter-item correlation of only .2 (equivalent to an average of only 4% shared variance between items!) will have a reliability coefficient of .88. In general, anything over .85 or so is considered good, so even by a creating a test with only 30 questions and weakly inter-correlated items, you can see that an instructor can end up with a very reliable test. Given that grades are typically derived from more than one test, the reliability of students’ overall grades will generally increase further. Moreover, if you were to increase the number of items on a given test to 90, reliability jumps to .96, or near perfect.</p>
<p class="MsoNormal">
<p class="MsoNormal">Note that because an average inter-item correlation of .2 is pretty low, the above calculation essentially gives instructors a free pass to have several bad questions on each exam. The net effect of poorly wording a question is to reduce its ability to correlate with other questions, because whether or not a student gets a bad question right depends on chance rather than knowledge. So smarter students are no more likely to get a bad question right than are poor students. Just how many bad questions one can afford to have on a test depends on how inter-correlated the <em>good </em>questions are; but it’s clear to see that even on a test of 30 questions with an average inter-item correlation of .2, having 4 or 5 questions that are completely uncorrelated with the rest of the test would have relatively little impact on the overall reliability of the test. And since reliability increases as a function of number of items, any concern about the drop can easily be offset by adding another 10 or 20 items.</p>
<p class="MsoNormal">
<p class="MsoNormal">Of course, all of this may initially seem like mumbo-jumbo to an irate student who feels they were mortally wronged by ambiguous wording on one or two questions. But it’s useful to explain nonetheless, because students who understand the logic will not only complain less, making your life easier, but will also have a more pleasant college experience, since they won’t spend four or more years feeling persecuted by malevolent instructors.</p>
<p class="MsoNormal">
<p class="MsoNormal">Having said all of this, there are a couple of important caveats, and one shouldn’t just conclude that <em>any</em> reasonably well-thought out multiple choice test is acceptable for class use. First, bad exam questions (even when there are only a few) do present a genuine problem for a small minority of students, namely those whose performance is at ceiling. If you’re a student who would have performed perfectly on a test made up of clear, relevant, and unambiguously-worded questions, the inclusion of bad items can only hurt you, since you have nowhere to go but down. In contrast, students who score lower in the distribution, say, around 75%, have little to complain about, since it’s entirely possible for their score to <em>increase</em> due to the inclusion of bad questions. Students who score near the bottom would actually experience a beneficial effect, with noise generally increasing their scores. But since the distribution of scores is almost always top-heavy in academic settings (more people pass than fail!), the overall net effect of unreliability is to shift the distribution of scores slightly downwards. In most cases this isn’t a problem since most instructors implicitly account for this (e.g., by making some exams ‘easy’ in order to shift scores upwards), but it’s worth keeping in mind anyway. Even if the reliability of your test is very high, it may still make sense to throw out the worst questions in order to prevent a systematic slip in the distribution.</p>
<p class="MsoNormal">
<p class="MsoNormal">A second and more important concern is that establishing that a test is reliable doesn’t necessarily mean it’s a <em>valid</em> measure of students’ learning. A reliable test is simply one that measures the same thing consistently. Nothing about the reliability coefficient tells you <em>what </em>that thing is. There are lots of things you could measure consistently in student populations that have little or nothing to do with the course material you’re teaching. For example, if you like to write extremely tricky multiple choice questions that require students to perform rigorous exercises in logic (e.g., “the answer can’t be A, because only one of these answers is right, and A entails that B is true as well”), you may well end up with highly reliable tests. However, these tests may not be valid measures of students’ knowledge of, say, organic chemistry or developmental psychology, because in effect, by turning your exams into an exercise in logic, you’ve loaded the ability to reason abstractly into your questions. In other words, what determines whether students do well on your exams may turn out to be their general level of fluid intelligence, and not the degree to which they’ve studied and assimilated the material. So while an <em>un</em>reliable test is <em>always </em>a lousy test, a reliable test <em>may </em>still be a lousy test. The ability to easily calculate Cronbach’s alpha isn’t an excuse to stop worrying about <em>what</em> your exams are testing for. But it does let you establish that wording problems or ambiguity on some questions don&#8217;t have much of an impact on your overall ability to measure students&#8217; performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.smallgraymatters.com/2006/08/26/multiple-choice-tests-why-you-shouldnt-panic/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
