Friday, December 10, 2010

Statistics: You're Doing it Wrong.

A lot of people don't really understand statistics, and it leads to all sorts of trouble. I get very tired of people quoting numbers and percentages that they've read in the paper while clearly not having a clue as to what they're actually talking about. Papers won't actually explain things properly either because that would make the story a whole lot less dramatic.  I'm hoping to clear some of that up here with a couple of examples of commonly misunderstood statistical talk.

1) Taking drug x increases your risk of cancer by 80%
Hearing this, a lot of people will immediately stop taking drug x, thinking that their chances of getting cancer from this stuff is now hugely amplified.
The actual fact is that it might not be. Lets assume that your probability of getting cancer at some point in your life is 1%.
80% of 1% is 0.8%. So, an increase of 80% means your probability of getting cancer is now 1.8%. Proportionally, this is significant. If your probability is already high, then you're in strife. But in real terms, if you are not already in a high risk group, you don't have all that much to worry about.
It's also why I assume some people use the term "percentage points", so that if probability has gone from 20% to 30%, it has gone up by 10 percentage points. It has also gone up by 50%.
2) How "averages" work in general
The "mean" is the biggest problem where these are concerned, and is the one most people take as what is normal. It is calculated by adding up each bit of data and then dividing your answer by how many bits there are.
When it comes to averages, it is unwise to only look at this one average. For example:
10 people are in a room. 9 of them will earn a salary of $40k this year. The 10th person is a CEO who will earn $4million. The average salary of the room is $436k per year, which is more than 10 times what most of the people in the room earn.
The data is skewed, and therefore not an accurate representation. This is why it is important to note the median (middle value) and mode (most frequent value), and to also look at your minimum and maximum data points so that you have a better idea of how the data is actually spread out.

The useful thing about the median is that it divides the population into two equal halves. So if you had people getting scores ranging from 0 to 100 and your median was 10, this means that half the people got less than 10 as their score. That's not very good...
There are also things called "quartiles". So they show you the quarters of your data. So if the lower quartile for the above info was 7, it means that a quarter of your people scored less than 7 and a quarter scored between 7 and 10. If your upper quartile is 95, then a quarter of your test subjects scored 95 or more (which is actually pretty good) and a quarter got between 10 an 95 (which is a heck of a range).
Medians and quartiles are a really good way of showing how data is spread out.

As far as simple representations go, I really like box and whisker plots. As far as representing data goes, I find them extremely useful: When you line up all the scores in order from smallest to largest, they show you the highest value, the lowest value, the middle value, and two values in between.
Snooty McSmugbox. The Box with Whsikers!

For example, let's say I have test scores for a class of 31:
Highest score: 90%
Lowest score: 10%
6 kids got exactly 30%
3 kids got exactly 34%
7 kids got 62%
13 kids got 74%

Here is the BW plot:

 Each section represents a quarter of my class. Half my kids got 62% or more, which is pretty good. More than half my kids got 50% or over (remember that the median splits the class into two equal halves), which is also not bad.

I think that they're pretty cool :-)


  1. I was part of a study where they asked "how many alcoholic drinks, on average," I consume each week.

    Okay, which average? Mean or median? Because honestly, those are different answers. Gaaah!

  2. No love for the midhinge? :-)

    One point I'd add — if your graph shows a "hockey stick", more often than not, you should use a logarithmic scale on one of your axes. Maybe both.