The mean in statistics
Yeah it seems simple, I mean (no pun intended) the mean is just the average! Yet as with so many different things in statistics there’s more to the mean than meets the eye! We’re going to go into why the mean is important, why it’s our best guess, why it may not always be your best option, and why we work so hard to find the mean sometimes! It seems simple, but I promise today we’re answering a lot of the big “why’s” in statistics, so let’s go!
Sometimes it’s the simple things that trip us up. More often than not in statistics, we deal with what’s called the normal (or gaussian) distribution, which is shaped like a bell. It’s probably the most well known distribution because it’s also the most common. It can be completely described with just two pieces of information, the mean and the standard deviation (or variance, which is just the square of the standard deviation). You can think of the variance as a measurement of how varied your data are, if they are very close to a single value the variance is low, if the values are all over the place the variance is high. Variance is a measure of uncertainty, so what makes the mean so special? While we’ve gone over a lot of complex stuff in this series, I think we should also take a step back and go into what makes the basics so important.
The mean is probably one of the easiest things to calculate. If you sum all the numbers and divide by the number of values you have, you’ve found the mean. For example if I wanted to find the mean of 1, 2, 3, 4, 5, 6 the mean would be (1+2+3+4+5+6)/6 = 3 which is why we’re ending the post here, congratulations you got it all.
Okay, okay there’s a little bit more to it than that. The mean is our “best guess” of the actual value we’re after… when dealing with a normal distribution or when we approximate the normal distribution (like when we use the central limit theorem or the t-test). We haven’t covered it yet, but there’s something called maximum likelihood estimation and it’s used a lot in state space modeling (a course I took last year and absolutely destroyed the curve. Yes I’m bragging a bit, it was a good class). When we estimate the maximum likelihood, simply put, you’re looking for the mean. Let’s look at the normal curve why this is.
My favorite little drawing of the normal distribution, think of the blue line as a probability line, so as the line gets close to zero (the tails) we have a lower probability. See where it peaks? Yep, it’s the mean, so the maximum likelihood, which is the number you’re most likely to get, is the mean! Again, for the normal, but it pops up soooo often it’s hard not to use the normal for just about every case you come across. Don’t worry, we’ll actually get into maximum likelihood because I was thoroughly confused about it when I looked at the formula until I realized it was just a way to find the mean.
So why is the mean our best guess (again… not to beat a dead horse, buuuut for the NORMAL DISTRIBUTION ONLY)? Well imagine we’re transported to a universe that is deterministic. Because we’re now in this universe we can readily predict the future, seriously. There’s no randomness that we have to fight with, so we can take a temperature and KNOW exactly what the temperature is to the precision of the termometer. We would know (not estimate) that a coin would flip to heads or tails BEFORE it was flipped, in fact, we could do the entire experiment without even touching the coin because we would know what was going to come out of it. In short, statistics would just be basic math.
However, we live in a probabilistic universe! Even at the tiniest units of measurements we know that atoms are probabilistic. So there’s always some randomness associated with everything we do, which is why we do experiments and crunch numbers only to get estimates and not exact values! It’s so sad. What we have here is some error associated with our measurements, no matter how we take that measurement, or how many times we take a measurement, there will always be some error associated with it. The mean is just the closest thing we can get (only normal…) to getting the ACTUAL value.
One more example, if I had an alien technology that could heat a box and knew I would be at 70.000000000 degrees perfectly (yes, that’s a lot of decimal places, it was a very exact alien technology), if we took 20 measurements of the temperature, we would find that our mean would be close to 70 (probably not exactly mind you), but there would be variance in our measurements, some would be 69.1 some would be 70.5, we may even get further away from the actual value and find 65 or 75. In short the mean is the best we can do to measure the true value, but it isn’t perfect, it’s susceptible to outliers.
Say all my data centered around 70 but I had one value that was 700, that would skew my mean substantially and suddenly it’s not a good estimate for the true value. The 700 measurement would be called an outlier (a value that’s two standard deviations from the mean is the general rule for an outlier). However, in most cases we can account for that by dropping the fairly obvious erroneous measurement.
I keep driving the point home that the thing we’re measuring needs to be normally distributed for the mean to be the most likely estimate and that’s true, which in a roundabout way brings us to the meat of the post. We’re here to answer the “why!” Why are there limitations to the tests we do, why do those tests work, basically why is statistics the way it is? While we can’t answer that last question (I mean that’s like asking the meaning of life) this whole post, while simple has answered some very important whys.
You may have noticed two things in all the stuff we’ve covered. Almost all the tests we do rely on the mean and they all mysteriously require us to have a normally distributed population for the test we’re applying to be valid. Why? Well because of what we just covered, the mean, when the population is normally distributed is the closest we can get to the “actual” value of the thing we’re interested in. We don’t need to measure every car to determine the average weight of a car, we can just measure a sample of the population and estimate the true (population) mean from that sample data.
That’s the heart of statistics, we’re relying on the fact that the mean is our best guess in a normal distribution and that can be used to tell us if one group is different from another, if a coin is fair, even if a mind control device is working. It all come down to using the fact that the universe is probabilistic in our favor. That’s really the big why and we answered it by something as simple as talking about the mean. To me that’s probably the oddest thing about statistics, we never really learn just how important something as simple as a mean can be even though pretty much all parametric statistics rely on it!
It’s also why we didn’t just start with the mean, it helps to see some of the tests we use in statistics before we highlight why they all seem to have the same limitations. I could’ve explained this up front, but I don’t think it would have connected the same way if we went that route. While not everything we do in statistics requires a normal distribution, everything we’ve done so far (and it’s a lot, make no mistake) requires a normal distribution and now we (finally) know why!