Monday, March 17, 2008

Eugene Volokh has a nice analysis of statements like, "10% of All X's Account for 25% of All Y's":
Consider a boundary case: Say that each police officer has a 10% chance of having a complaint this year. Then, on average 10% of all officers will have 100% of this year's complaints. Likewise, say that each police officer has a 1% chance of having a complaint each year for 10 years, and the probabilities are independent from year to year (since complaints are entirely random, and all the officers are equally prone to them). Then, on average 9.5% (1 - 0.99^10) of all police officers will have 100% of the complaints over the 10 years, since 0.99^10 of the officers will have no complaints.

Or consider a less boundary case, where the math is still easily intuitive. Say that you have 100 honest coins, each 50% likely to turn up heads and tails. You toss each coin twice. On average,

* 25 of the coins will come up heads twice, accounting for 50 heads.
* 50 of the coins will come up heads once and tails once, accounting for 50 heads.
* 25 of the coins will come up tails twice, accounting for 0 heads.

This means that 25% of the coins account for 50% of the heads -- but because of randomness, not because some particular coins are more likely to turn up heads than others.

Likewise, we see the same in slightly more complicated models. Say that each police officer has a 10% chance of having a complaint each year, and we're looking at results over 10 years. Then 7% of all officers will have 3 or more complaints (that's SUM (10-choose-i x 0.1^i x 0.9^(10-i)) as i goes from 3 to 10). But those 7% will account for 22.5% of all complaints (that's SUM (10-choose-i x 0.1^i x 0.9^(10-i) x i) as i goes from 3 to 10). And again this is so even though each officer is equally likely to get a complaint in any year.

Now of course it seems very likely that in fact some officers are more prone to complaints than others. My point is simply that this conclusion can't flow from our observation of the 10%/25% disparity, or 7%/22.5% disparity, or even a 20%/80% disparity. We can reasonably believe it for other reasons (such as our knowledge of human nature), but not because of that disparity, because that disparity is entirely consistent with a model in which all officers are equally prone to complaints.

...But often we hear just a "10% of all X's account for 25% of all Y's" report, or some such, and are asked to infer from there that those 10% have a disproportionate propensity to Y. And that inference is not sound, because these numbers can easily be reached even if everyone's propensity is equal.