Statistical significance in a testing world

Statistically significant is a phrase that many people throw around to add a little gravitas to their arguments, but I often wonder how many of these people actually understand what this phrase means. Some of the more frequent explanations I hear are that it means the p-value is small, or it signifies an important outcome to take note of, or it is an outcome that is most likely not occurring by chance. All of these responses are true, but having a fundamental understanding of the mechanics that lie beneath this phrase is essential for anyone responsible for conducting A/B tests, multivariate tests or pretty much any situation where analytics are being leveraged to defend an idea or hypothesis.

Last month, my blog post focused on the concept of hypothesis testing in statistics. Namely, defining what is meant by the null and alternative hypotheses, and why it is important to define these before any data have been collected. This month I will focus on how to interpret the results of your study after your hypotheses have been defined, the data collected and the results analyzed.

Let’s consider the following situation:

You have a friend who claims he has the ability to correctly predict the outcome of a coin toss more than 50% of the time. As a rational person you, of course, are skeptical, but you also realize that the only way to settle this assertion is to put your friend to the test and begin flipping coins. Wisely, however, you remember reading my last blog post on Marketing Forward about the importance of defining your hypotheses upfront before any data has been collected. So, you write the following on a piece of paper:

Null Hypothesis: My friend can only correctly predict the outcome of a coin flip 50% of the time

Alternative Hypothesis: My friend can correctly predict the outcome of a coin flip more than 50% of the time

Your friend agrees with your hypotheses and then you two begin a discussion around choosing an alpha level. The alpha level (α) is very important when it comes to statistical significance because it is the dividing line between what will be deemed statistically significant and what will not be deemed statistically significant. In a lot of ways this can make the phrase statistically significant seem rather arbitrary, which is why it is important to always choose your alpha level before any statistics are calculated. A typical alpha level is 5% or 0.05. What this means is that if the result of your experiment could only happen by chance less than 5% of the time, then the result is defined as being statistically significant. Similarly, if an alpha level of 1% or 0.01 is chosen (a stricter test) then statistical significance can only be claimed if the result should only happen by chance alone less than 1% of the time. But because your friend is so confident in his supernatural ability to correctly predict the outcome of a coin flip he is willing to meet that higher burden of proof and allows you to set the alpha level at 1%.

A visual way to understand the process described above would be as follows:

significance level

We all know there is natural variability involved with flipping a coin, but done repeatedly, we expect to arrive at a proportion of heads or tails that is very close to 50%. However, if your friend can truly predict the outcome of a coin flip at a rate significantly better than 50%, we are willing to reject the null hypothesis and conclude he does have some sort of supernatural gift or ability. But, we are only willing to give him this statistically significant designation if he lands in the top 1% (because we chose an α-level of 0.01) of the distribution shown above.

The exact value of this top 1% (rejection region) is dependent upon the number of coin flips your friend must try and predict. Let’s assume you guys both agree on 100 coin flips. In this case, the above graph can be updated with the following numbers.

significance level 100 flips

This graph adds two new points of reference. The first is at the center of the distribution where P = 50% (‘P’ here is notation for proportion). This corresponds to what is assumed under the null hypothesis. Remember that in the null hypothesis we are stating that our friend has no predictive ability, and so his chances of predicting correctly should be centered at 50%. The other reference point is P = 62.9%, this percentage represents the 99th percentile of the distribution and is the dividing line between a statistically significant result and a result that is not statistically significant.

Well, after 100 flips of the coin your friend records an impressive (but not statistically significant) result of a 60% success rate. In this situation a 60% success rate corresponds to a p-value of .0228 or 2.28%. The p-value is always defined as the probability of getting a result as extreme or more extreme than what is observed, assuming the null hypothesis is true. In other words, if your friend’s ability to correctly predict the outcome of a coin toss is only 50%, there is still a 2.28% chance that his 60% success rate happened purely by chance alone. And since this percentage is higher than our alpha level, he does not fall into the rejection region and we fail to reject the null hypothesis. In other words, we do not see enough evidence statistically to believe his claim of supernatural coin flipping prediction ability.

An interesting note to mention, however, is that if the α-level had been set at 0.05 or 5%, the p-value of 2.28% would be less than the threshold required for statistical significance (5%) and we would have concluded that our friend does possess special predictive powers. This highlights the need to always specify an alpha level before proceeding with any statistical calculations as practitioners may be tempted to adjust their original alpha levels after results have been calculated in order to reach a statistically significant result.

To summarize, statistically significant is a phrase that has an inherent meaning and interpretation. But for those without a firm understanding of what is meant by the terms alpha level, p-value and null and alternative hypotheses, this phrase should be used a little less liberally until proper background knowledge is comprehended. Gaining this understanding will aid you when interpreting and defending the results of any statistical tests you have performed.