Statistical hypothesis testing

In a couple of recent blog posts we highlighted some of the best practices to keep in mind when implementing an A/B test, as well as a common misuse of A/B testing that too many practitioners unfortunately employ when attempting to research a hypothesis.

In this post we will take a few steps back and discuss at a higher level the reasoning behind statistical hypothesis testing. Having a better understanding of the mechanics of what hypothesis testing is and how it translates directly to your A/B or multivariate testing endeavors will aid you in both the design of your study and the interpretation of your results.

Suppose you currently send out a weekly newsletter to your email subscribers every Friday at 9AM. Your manager is curious if open rates from this newsletter would increase significantly from your current level of 22% if you changed the delivery time to every Friday at 1PM. To address your manager’s question you randomly select 2000 subscribers from your list and email them this week’s newsletter at the adjusted time. After waiting the necessary amount of time for the data to accumulate, you see that the subscribers who received the newsletter at 1PM had an open rate of 24%.

Can you now conclude that emailing all newsletter subscribers at 1PM would significantly increase open rates from the historical rate of 22%? Because the result is based on a sample, there is a possibility that the observed open rate (24%) may have occurred just by the luck of the draw. If your entire subscriber population were actually sent the newsletter at 1PM, how likely is it that your open rate would still be 24% or better?

Hypothesis testing uses data from a sample (2000 subscribers) to judge whether or not a statement about a population (your entire newsletter subscriber list) may be true. Many of the questions that researchers ask can also be expressed as questions about which of two statements might be true for a population. In the example just given, we are essentially just asking the question of, ‘Does mailing my newsletter on Friday at 1PM significantly increase my open rate from its current level of 22%?’ This question can be answered with either a ‘no’ or a ‘yes’ and each possible answer makes a specific statement about the situation.

Hypothesis 1: Mailing at 1PM does not change my current open rate of 22%

Hypothesis 2: Mailing at 1PM does significantly increase my current open rate of 22%

In the language of statistics, the two possible answers to the question we just encountered are called the null hypothesis (Hypothesis 1) and the alternative hypothesis (Hypothesis 2). The null hypothesis is a statement that there is nothing happening, or the status quo is intact. In most situations, the research hopes to disprove or reject the null hypothesis. The alternative hypothesis is a statement that something is happening, in most situations this hypothesis is what the research hopes to prove.

The logic of statistical hypothesis testing is similar to the presumed innocent until proven guilty principle of the U.S. judicial system. In hypothesis testing, we assume that the null hypothesis is a possible truth until the sample data conclusively demonstrate otherwise. The ‘something is happening’ hypothesis (alternative hypothesis) is chosen only when the data show us that we can reject the ‘nothing is happening’ hypothesis (null hypothesis).

The hypothesis testing method is a somewhat indirect strategy for making a decision. We are not able to determine the chance that a hypothesis statement is either true or false. We can only assess whether or not the observed data are consistent with an assumption that the null hypothesis is true about the population, within the reasonable bounds of sampling variability. If the sample data collected were unlikely to materialize just by chance when the null hypothesis is true, we reject the statement made in the null hypothesis.

When we do a hypothesis test, the objective is to decide if the null hypothesis should be rejected in favor of the alternative hypothesis. The process is as follows:

  • Define your null and alternative hypotheses.
  • Compute the data summary that is used to evaluate the two hypotheses, called the test statistic
  • Compute the likelihood that we would have observed a test statistic as extreme as what we did, or something more extreme, if the null hypothesis is true, called the p-value
  • The decision is made to accept the alternative hypothesis if the p-value is smaller than a designated level of significance, denoted by the Greek letter alpha (α), and usually set by researchers at .05, less commonly at .10 or .01. If the p-value < alpha than we have achieved statistical significance (more on this next time)

All of this may sound rather academic and unnecessary, but it is always important to formalize what it is that you are testing. If your hypotheses are unclear or undefined then you are essentially saying you don’t know the question that you are seeking an answer to and any kind of data or results that flow from your study become void of meaning.