Marketers rely on data and analytics to ensure they are equipped with the best insights to stay relevant in today’s digital world. So of course it’s important for marketers to master analytical skills and define processes that can effectively scale with the constant feeds of data over time. In my last blog post I focused on the importance of incorporating correct sampling techniques when designing a study and gave some examples of what can go wrong if forethought is not applied. In this post I’ll focus on the problem of over fitting in predictive modeling and how too many people don’t realize the importance and function of always having at least one hold-out data set when building predictive models.
The main goal in predictive modeling is to be just that, predictive, or to have the ability to accurately predict the future based on historical data. The quality of predictions becomes severely hampered, though, when practitioners fail to incorporate hold out data sets when building their models. A hold out data set (also known as a validation data set) is data that is not seen or used when building a predictive model. It is purposefully left out of the model building phase in order to quantify how well the resulting model will generalize to new data. When predictive models are built using all available data (without a hold out data set) it is called “over fitting,” as the model picks up on every little nuance in only the data it sees. This causes the parameter estimates to be accurate only for the set of data it is currently analyzing, and the results to not perform well when applied against newly introduced data. Take a look at the graphic below.
The blue line represents the model performance for the training data (the data that is being used to build the model). The error rate of this data continues to decrease as the model becomes increasingly complex. However, every model will reach a point where added complexity is not desirable because it is only fitting to the intricacies of the data that it is seeing, while at the same time losing the ability to generalize results to data that is not yet seen. This is why any predictive model should be tested against at least one set of validation data.
As mentioned, validation data is data that the model being built is not using to fit parameter values against, but is instead being used as a hold out set of data to evaluate how well the model can generalize to new data. The red line in the graphic represents the performance of the model on validation data. As you can see, there comes a point where the added model complexity no longer helps, but hurts performance in the holdout data. It is after this point that over fitting begins, as the increasingly complex model proceeds to only pick up signal from the data it sees, and this signal is not representative of the overall behavior exhibited in the full set of data. Models that continue to be built after this level of complexity will not generalize well to new data points and will lead to poor prediction estimates. The minimum point of the true validation error should determine the stopping point of model complexity, not the minimum point of the training error.
The inclusion of a validation data set is crucial when building a predictive model in order for your model to be more accurate and stable over time. The two graphs below highlight this point. The graph on the left does a good job at capturing the general nature of the relationship between the X and Y variables, while the graph on the right is trying too hard to capture every subtle change in the relationship between the two variables. This means that when new data points are fed into the model, the model on the left will outperform the model on the right, as the model on the right will not generalize well to data it has not seen before — its functional form is too customized.
In a marketing context, the above situation could easily play itself out in any kind of response modeling. For example, let’s say an email marketer analyzes all of their email history data for their subscribers with the goal of predicting the open rate of an individual email based upon a variety of subscriber-level attributes. The marketer may be able to build a model that is fairly accurate and can correctly classify an opener vs. non-opener 70% of the time. The problem is that this 70% is the performance of the model on the same data that was being used to build the model and the company is fooled into thinking that this success rate can be applied to new users not seen by the model. Next month though, they get an entirely new list of emails and score the same model they built previously to these new email addresses in order to determine who they should send an email to. They are surprised to find that their model did not perform as well on this new data (data the model had not seen before), with only a 50% success rate in predicting openers vs. non-openers.
In retrospect, the company realizes they should have split their initial data set (all email history) into two parts, a training data set and a validation data set. The training data set should have contained around 65% of all records, while the validation data set should have contained the other 35%. The model could have then been built on the training data while simultaneously factoring in the model’s performance on the validation (hold out) data. In this case, the optimal model is not chosen based on where the training data set’s error is minimized, but instead on where the validation data set’s error is minimized. The result is a model that accurately predicts openers vs. non-openers 65% of the time. Yes, this is a lower classification rate, but it is a much more stable and accurate picture of the company’s true ability to determine openers vs. non-openers. Therefore the next month, when the company receives an entirely new list of emails, they’ll find that their ability to predict openers vs. non-openers will be fairly close to this 65% rate.
In conclusion, any direct response marketer interested in building models to predict customer behavior should always be wary of the effects of over fitting and remember to validate model performance against a hold out set of data. Stay tuned for additional posts on more common data analysis traps to avoid along with tips to ensure your analyses and conclusions are always accurate and actionable, or visit us online athttp://www.experian.com/marketing-services/analytical-services.html for more information.
Learn more about the author, Adam Sugano