Who | What | Helped | Not Helped | Percent Helped | |
a. | Men | Drug | 40 | 60 | 40% |
b. | Men | Placebo | 90 | 110 | 45% |
c. | Women | Drug | 150 | 50 | 75% |
d. | Women | Placebo | 80 | 20 | 80% |
e. | Both | Drug | 190 | 110 | 63% |
f. | Both | Placebo | 170 | 130 | 57% |
14 Haziran 2012 Perşembe
Bad for Men, Bad for Women, But Good for Humans as a Whole?
To contact us Click HERE
Suppose a company is testing a new drug. It is less effective than the placebo at treating men. Also, it is less effective than the placebo at treating women. Yet the company claims that overall, it is actually more effective than the placebo! Can this happen, and if so, how? Today we are going to look at a probability "paradox", sometimes called "Simpson's Paradox". As you will see, it is not really a paradox - math does work, after all - it simply gives what initially seems to be a counter-intuitive answer. By the end of this post, you will understand what is going on and how to protect against it in real world situations. Of course, one possible explanation is that someone made a mistake,unintentionally or not, but that's not the mathematically interestingexplanation. Instead, consider the following table of data. In this(hypothetical, of course) experiment, we test all four combinations ofgender (men versus women) and treatment regime (drug versusplacebo). In each of the four cases, we count how many people werehelped by the treatment, and how many were not. These counts arereported in the eight green cells in the middle of the table, rows(a.) through (d.):
For convenience in interpreting the results, we have added some additional rows and columns that are computed from the data. These are not additional data points, just useful totals and ratios. The last two rows simply total up the men and women: the counts in row (e.) come from adding rows (a.) and (c.), while (f.) is the sum of (b.) and (d.)The last column converts the counts into the percent who were helped by the treatment. For instance, the final value in the table is 170/(170+130) = 57%.Looking at the final column, we see that as claimed, the drugis less effective than the placebo at treating men: (a.) 40% of men werehelped by the drug, compared with (b.) 45% who got better on their own justfrom the placebo. Also, the drug is less effective than theplacebo at treating women: (c.) 75% were helped by the drug, compared to(d.) 80% who got better on their own justfrom the placebo.This is evidence that the drug may actually cause harm:whether you are a man or a woman, your odds of recovery arebetter if you take the placebo instead of the drug!However, if we ignore gender and simply look at the two bottom rows,which total everything up, we see the counter-intuitive part: overall,across the entire study, the drug appears to be more effectivethan the placebo! In particular, (e.) 63% of patients taking the drug gotbetter, compared with only (f.) 57% of those taking the placebo.So, we have answered the original question: yes, this can happen, andhere is an example.But what does this mean?Is the drug helpful or is it harmful? Can we trust the results ofscientific experiments? Why did we get such a counter-intuitiveconclusion?Take a moment to think, and see if you can tell why this happened.In particular, examine the eight green cells carefully, since they arethe only bits of data in the case - everything else is computed fromthem.Again, we are assuming the numbers are correct - ignore thepossibility that someone misreported something. All eight numbers areperfectly plausible outcomes in the real world.The issue is subtle. It concerns the design of the experiment.We are trained to believe the results of scientific experiments,because we have been taught that a well constructed study should giveresults that are representative of the impact you would see inthe real world.The italicized word representative is a clue.In the real world, the population is approximately 50% men and 50%women. For the moment, let's suppose both men and women are equallylikely to get the disease studied in this experiment.Row (f.) summarizes the results for theplacebo. The 300 people who received it were not 50% men and 50%women. Instead, they consisted of 200 men and 100 women, or 67% menand 33% women. That is not representative of the real world, sothe resulting average success rate of 57% in the last columnis not typical of what you would see in the real world. Infact, if we take a 50-50 average of the placebo success rates for men (45%)and women (80%), we would expect to see a placebo success rate around62% in the real world: higher than reported in (f.)Row (e.) summarizes the results for thedrug. Again, the 300 people who received it were not 50% men and 50%women. They consisted of 100 men and 200 women, or 33% menand 67% women. Again this is not representative of the real world, sothe resulting average success rate of 63% in the last columnis not typical of what you would see in the real world. Infact, if we take a 50-50 average of the drug success rates for men (40%)and women (75%), we would expect to see a placebo success rate around57% in the real world: lower than reported in (e.)Thus, the correct conclusion to draw from the data is that in the realworld, when applied to patients of all genders, one would expect thedrug to be 57% effective, while the placebo will be 62% effective.Now the results agree: from every (valid) perspective, the placebo isbetter than the drug. And that is no longercounter-intuitive. The previous paragraphs assume men and women are equallylikely to have whatever disease the drug is intended to treat. If infact the disease is more (or less) common in men, we would weight themen more (or less). But - and this is the key point - we would have todo so in both the drug and placebo cases. If a representativeresult is 10% men and 90% women, so be it - but we need to use thatsame 10-90 weighting in both (e.) and (f.) - we do not get to usedifferent weights in different cases.Using the same weights consistently will make the placebo score betterthan the drug. Since the placebo is better for men and womenseparately, it will also be better for any weighted average oftheir results, provided we use consistent weighting.In fact, since the placebo outperforms the drug by 5 percentage pointsfor both men and women separately, it will also outperform by 5 pointswhen aggregated up to the total population, regardless of theweight we put on men versus women. All we have to do is be consistentand use the same weights when aggregating the placebo results as weused to aggregate the drug results. An even simpler way to proceed is to report the results in rows (a.)through (d.), and never even try to aggregate men and womentogether. Patients know their own gender, so they can look in theappropriate data row to see what works or doesn't work for them. Rows(e.) and (f.) are not needed for communicating the results!This approach has the advantage that we do not need to know theincidence rates in the population - though as remarked above,regardless of what those incidence rates are, it will not change theconclusion that the placebo is better than the drug. The (hypothetical) company based its (hypothetical) claim ofeffectiveness on the last two rows in the table, which show a higherpercentage of study participants helped by the drug than by theplacebo. Their error was to think that this aggregated percentagewould apply to the real world. It does not.Women in this experiment get better outcomes than men: they recover athigher rates regardless of which kind of treatment they receive. Row(e.) weights women 67%, twice as high as men, which pushes up therecovery rate. Row (f.) weights women 33%, half as much as men, whichpushes down the recovery rate. That's how (e.)manages to get a higher percentage helped than (f.)Real clinical trials conducted by reputable companies strive to avoidthis sort of problem by carefully designing the experiment to includea representative number of patients from all the different categories(gender, race, income, etc.) they think might be relevant - or byweighting the results properly - or by stating the results in terms ofthe individual rows of data, rather than trying to aggregate them atall.Nonetheless, this example is sobering. One take-away is to becautious about aggregating results, especially from non-representativesamples. A shift in the mix can distort results!It is better to look at the results for men and womenseparately than to try to combine them together. Yet, people prefercombining results - both because one number is simpler to understandthan two, and because by aggregating, the sample size appears to belarger, which makes the results seem (superficially) to be moreaccurate.The other take-away is the need to think carefully about what factors arerelevant. In this example, recovery rates depended on gender, which isa factor in our study, so we can identify and fix the aggregationproblem by looking at the results for men and women separately. Butsuppose the real driving factor was race - or income - or exposure toa particular environmental contaminant - or inheritance of aparticular gene. In any of these cases, the results in the tablewould already be aggregated, and we would have no way (from thedata at hand) to remedy the problem by splitting the results by race,income, environment, or genetics, because we did not measurethose factors when we collected our data!Worse yet, that means we might be unaware that we are even in aSimpson's Paradox situation! If we only have the last two rows in thetable to look at - for instance, because no one bothered to trackgender - then we have no way to tell than the percentages are in factdirectionally wrong!But since we can never be sure in advance what the complete list ofrelevant factors is, how can we protect against this sort of problem?There turns out to be a good solution: randomize the treatments. In agood clinical trial design, patients are randomly placed in either thedrug or placebo bucket. That generally solves the problem. In ourexample, in a study with 300 men and 300 women, randomized assignmentwould put approximately 150 men and 150 women in the placebo group,and another 150 men and 150 women in the drug group. While the exactnumbers might vary, the odds of getting something like 100 men and 200women is extremely unlikely.Of course, even randomization does not solve the problem if the samplesize is too small: with just 6 participants, random chance couldeasily make the "mistake" of assigning all the women to the placeboand all the men to the drug. Or vice versa. With an obvious factor like gender, you can double check to ensurethat random chance did not, in fact, produce a bad assignment ofpeople. But in general, you don't know what all the factors mightbe. That means randomization must be coupled with a large sample size.In clinical trials, it is also a good idea to not tell either thedoctor or the patient whether they have the drug or the placebo, sinceotherwise the knowledge may influence their subjective reporting ofthe improvement in their symptoms. The same is true in marketresearch: it is best not to tell participants the goal of the study ortheir role within it, lest it color their responses. Similar issues arise in many other contexts. Any time you areunable to perform a controlled, randomized experiment, you are at riskthat the patterns you observe are not causal, but are due to spuriouscorrelations underlying the way the data was collected. This appliesto regression analyses, too, not just contingency tables - leaving outa causal factor can seriously distort the conclusions in a regression. Mix-shifting can distort all sorts of averages. Consider acompany with two products. Suppose the company raises priceon both products. It will sell fewer units, so its totalprofit might decline, depending on the elasticity. But surelyits overall profit margin, namely profit divided by revenue, willincrease? Not necessarily, if the mix shifts away from the productwith the higher margin. But this post is already long, so that examplewill need to wait for another day - or see if you can construct suchan example yourself.If you liked this article, you may also like Whenis a change statistically significant?, orpossibly FamilyGenetics. Or check outthe Contentspage for a complete list of past topics.Please post questions, comments and other suggestions using the box below, or email me directly at the address given by the clues at the end of the Welcome post. Remember that you can sign up for email alerts about new posts by entering your address in the widget on the sidebar. If you prefer, you can follow @ingThruMath on Twitter, where I will tweet about each new post to this blog. The Contents page has a complete list of previous articles in historical order. Now that there are starting to be a lot of articles, you may also want to use the 'Topic', 'Search' or 'Archive' widgets in the side-bar to find other articles of related interest.I hope you enjoyed this discussion. You can click the "M"button below to email this post to a friend, or the "t" button toTweet it, or the "f" button to share it on Facebook, and so on. Seeyou next time!
Kaydol:
Kayıt Yorumları (Atom)
Hiç yorum yok:
Yorum Gönder