Sampling techniques in statistical experiments
With data and analytics proving to be a vital asset across business verticals today, it is becoming increasingly necessary for personnel to strengthen their analytical know-how. As a statistician, it gets me very excited to see the new found interest businesses are taking in data analysis. When conducted properly, statistical analyses can provide a world of insight and improve greatly upon the status-quo. However, statistics is also a subject filled with delicate nuances that can easily lead to misguided conclusions without ample training and care.
That being said, I have decided to devote my next few blog entries to the importance of designing statistical experiments in a thoughtful manner, while also highlighting some of the common traps and mistakes that should be avoided when taking on an analytic engagement.
Improper random sampling
We’ll start by discussing the importance of proper sampling techniques when designing statistical experiments. One of the fundamental reasons statistics exists as an area of research is that it offers a way to infer meaning and insights from a small subset of data, which can be applied to a much broader context. For example, if I wanted to know the true average height of all males between 18-55 years of age, I would literally have to find all 18-55 year old males on the planet, measure their heights and calculate the average. Obviously, this is an unrealistic scenario. However, statistics can provide a framework from which we can alternatively take a random sample from the population and come up with an estimate that is exceedingly close to the true average height of all males ages 18-55, whatever that value may be.
Issues do often arise, though, when people or companies collect and use samples that are not accurately representative of the greater population of interest. For example, if the sample group was made up of all players and coaches in the NBA between the ages of 18-55, the average height computed from that sampling would be significantly higher than the true average of the general population. This is an extreme case of sampling gone awry, and one that I imagine (hope) very few people would ever make.
However, there are certainly more subtle ways that improper sampling techniques can work their way into sampling methodologies. The Chicago Tribune’s prematurely printed headline ‘Dewey Defeats Turman’ for their article on the 1948 election serves as a more famous example of a botched sampling. The problem was Truman actually ended up winning, and what caused the Tribune’s statistics to be so inaccurate was that their sampling method of contacting people was limited to phone calls.
In those days, people who had telephones in their homes were not a representative subset of the entire American population, since those individuals tended to be wealthier and had stable addresses. Moreover, the sample the Tribune based its headline on was over two weeks old (pretty stale in terms of election data).
Although maybe not as extreme, this situation is quite similar to those that many marketers may find themselves in when not enough prior thought is given around how to collect sample data. The main takeaway is that in order for sample data to be accurate, it needs to be representative of the population under consideration. This is very important for us, as marketers, to keep in mind, seeing as we are constantly interested in learning more about our customers: who they are, what they are willing to pay for a product, what they like/dislike about a product or service, etc. But before we send out any survey or test a new ad on a sample group, we need to make sure that those who have been selected for their opinion or response are a true subset of what the overall customer base really looks like. If this type of thinking and planning is not fleshed out beforehand, results will more likely be biased and often times rendered useless. So before any data collection begins, make sure to consider the effects that variables like time (seasonal, weekly or intraday variability), place (regional, rural/urban, home/work variability) or audience can have when taking a sample. Without a truly representative sample, the results that are derived from testing data cannot be reliably generalized to the rest of a population or customer base.
Stay tuned for additional posts that highlight other common data analysis traps to avoid, along with tips to ensure your analyses and conclusions are accurate and actionable.