The Bonferroni correction in statistics

Well we’re doing it, today we’re talking about the Bonferroni correction, which is just one of many different ways to correct your analysis when you’re doing multiple comparisons. There are a lot of reasons you may want to do multiple comparisons and your privacy is our main concern so we won’t ask why. Instead we’re going to talk about how to adjust your alpha (chances of making a type 1 error) so you don’t end up making a mistake.
I’m not big on statistics, not going to lie. What I mean is that it’s something I should know, but not something that comes easy to me. That’s what my “in statistics” series is all about. Math in general never came easy, but I realized it was mostly because I didn’t understand why things worked. I knew that they worked, but my stupid brain couldn’t figure out how they came about and thus couldn’t recognize when to use the damn things! Once I figured out the why, things got a lot easier for me. If you fall into that same group, then this is for you. I’m not here to just give you the how, but the why. Sometimes once you understand the why, you can figure out the how all on your own (but we dig into that too).
The Bonferroni correction (the full name is the Bonferroni correction for multiple comparisons) was something introduced to me early on in my career and I didn’t understand why it was used. I was explained when to use it, but that didn’t help me understand why it even came about. It turns out the answer is pretty straightforward once you dig into it a little bit. It has to do with the chances you’ll make a type 1 error.
As we previously mentioned type 1 errors are false positives. If we’ve made a type 1 error we found significance in our observed value when there actually wasn’t significance. We discussed one possible way this could happen, if we reduced our threshold for what we considered significant, we could essentially find “significance” in any value we got! However, just like any error there is more than one way to make it!
When we are computing statistics, we sometimes make what is called multiple comparisons. I’ll give an example first. Say we perform an experiment testing memory, so we give our subjects five different memory tests. We are now making something called multiple comparisons. Multiple comparisons are when we test one thing (memory) by five different methods. When we collect this data, we now have five different ways to have something be significant, this raises our chances that we will end up with a false positive because we’ve increased the chance that one of our tests will be an outlier result (IE they did better on one test by chance).
The simple explanation of this without using an example is the more comparisons you do, the more likely you’ll find SOMETHING significant. This is because statistically speaking the more times you “look” or rather the more comparisons you do, the more chances you have of finding something significant just by chance. It’s like picking the answers on a test blindly, there is a chance (albeit a small one) that you could blindly select all the correct answers. Do the test once and your chances of it happening are low, but do it 10,000 times and suddenly you have a good chance of at least getting one test with a score higher than someone who took the test who knew the material.
This led to the creation of the Bonferroni correction method. It’s a simple ways to control for multiple comparisons, but it does have it’s limitations. As with almost all the things we’ve covered, this method is named after the person who came up with it, Carlo Emilio Bonferroni. From the methods I know of, it is the strictest, so we don’t actually use it all that often. However, if you do use it and find something significant than you can rest assured you didn’t under correct at least.
In normal hypothesis testing we will set our confidence level at ~5%, that is to say that we are 95% sure that our result — if significant — is not significant by chance. This gives us a α = 0.05 (where our determined α needs to be less than 0.05), but when we do multiple comparisons, we have to adjust this and thus we use Bonferroni’s method. Using our memory testing example, we have 5 tests all checking memory. Bonferroni says that our corrected confidence level is actually
α_actual = 1-(1-α)^n
Plugging in our values ( α = 0.05 and n = 5) we find:
α_actual = 1-(1-0.05)^5 = 0.2262190625
Which says that the probability of finding a significant result with a alpha (α) set at 0.05 and using 5 comparisons gives us a actual α = 0.226 or a 22.6% chance of making a type 1 error. This is just for five comparisons, if we instead had ten tests, this number would jump to α = 0.401 or 40.1% chance of making a type 1 error. At twenty tests, we would have a α = 0.641 or a 64.1% chance of making a type 1 error. So you can see that as we increase the number of tests we do, unless we adjust our confidence level, we are more and more likely to make a type 1 error. Fear not fellow stats person, Bonferroni came to the rescue with the Bonferroni-corrected p-value. This is easy to find too, it is just:
α_corrected = α/n
where α is the significance threshold (in our example we used 0.05) and n is the number of tests being performed (again in our example we had 5 tests). So if we plug in our values of α = 0.05 and n = 5 we see that in order to correct for our multiple comparisons we need to adjust our threshold to α = 0.01. On the extreme end, if we have 20 tests our α would now be α = 0.0025.
For some of you who are more experienced, you may already see the limitation, it is possible to have hundreds if not thousands of multiple comparisons and that would make the corrected α incredibly low, almost impossible to find significance low. Thankfully there are other methods that have their own strengths and weaknesses, like the Tukey HSD method, or the Holm-Bonferroni method. That’s a whole other topic for another time though!
But enough about us, what about you?