Backtesting EmaiLO, our email rating system’s newest model

Thoughts on our latest EmaiLO rating model and the trade-offs we make

Last month, we introduced EmaiLO, our variation on the ELO rating system specifically designed to work in email marketing. We’ve continued backtesting the formula implementation and playing with various scenarios and use-cases, some of which we’ll write about in the near future, and we’ve also made some tweaks (one is relatively major, but I’ll write about that at a later date), but overall, we remain excited about the opportunities present in the email rating model. Rather than discuss the finer details of the model construction, I wanted to use this column to write about the backtested results.  Since I’ve read Predictably Irrational and know that it’s better to rip off the bandage, let’s just jump into it (that means math!).

Backtesting EmaiLO

I’ll be the first to proclaim that I’m not a statistician – I’m a marketer.  And I think this is an important distinction, particularly for someone that is as deep in data as I am.  When thinking like a marketer or a statistician (or a rational human being, for that matter), there comes a time when one must make a decision about trade-offs.  Is the time I spend on project A worth more than what I’d spend on Project B?  Is this model extremely predictive but too specific?  If I spend more time working on my yard when will I catch up on The Night Of?  The world is full of trade-offs, and this is true of any model, analysis, or business decision.  And the distinction between me being a marketer and me being a statistician comes down to trade-offs.

In the case of EmaiLO, the trade-off is pretty obvious – we trade precision for simplicity.  How much of one do we trade for the other?  That’s where the rubber meets the road, and one that is likely answered very differently when posed to a statistician versus a marketer.  To me, both the statistician and the marketer should make their decision based on one thing – information.  At what point does additional precision limit actionable information?  At what point does a desire for simplicity reduce the precision of information?  As a marketer, I am willing to give up a little bit of statistical rigor if I can gain more (or at least similar amounts) of information.

Ultimately, EmaiLO is not a predictive model, though it has predictive ability.  This is because, all things considered, we’re more interested in the model to describe a program’s in-moment strength.  We also want the model to be broad enough that it doesn’t need tweaked based on the program it’s describing. And finally, we want a relatively limited input set to limit overfitting and make the task of keeping an ongoing database easy.  All of this stems from a conscious decision to trade-off model precision for simplicity while retaining the most informative elements of the model.

That being said – we still need accuracy in our model, and in order to understand that, we engage in the concept of backtesting.  Backtesting, put simply, is running historical data through the model to see how close the predictions are to the actual outcomes.  Generally, I use a method called Root Mean Squared Error (RMSE) that allows us to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. The generalized formula is:


Running our model predictions vs actuals give us the following RMSE values


Since the RMSEs effectively work as a standard deviation of our error, we know that 68% of our open rate predictions are within 3.6% of the actual outcomes.  To be honest, this isn’t a great model fit…but remember that EmaiLO intentionally doesn’t account for autocorrelation (the tendency for historical outcomes to be highly correlated with future outcomes), but rather uses it to help calibrate future outputs.  Ignoring our early predictions gets us on a better track, reducing our average error across all KPIs by about 17%


The RMSEs above are calibrated against our entire dataset – including some major outliers.  Removing them gets us even more predictive power…not bad for a model intended to describe, not predict.  Visualizing the relationship shows a pretty strong consensus between EmaiLO’s underlying prediction and the observed values – over 68% of variation in actual click rate is explained by the EmaiLO model with a simple linear regression.


But EmaiLO is more than just its underlying predictive elements – as a ratings system, it’s meant to not only adjust itself when a performance varies from expectation, but it’s also meant to do that better over time.  And it’s quite effective at it!  Visualizing blended EmaiLO ratinings at different timeframes illustrates this nicely.  Examining EmaiLO’s relationship between its second observation and its third shows a high degree of correlation (this means EmaiLO starts doing its job pretty quickly, as an appropriately calibrated model means we shouldn’t have to adjust too much).


But the model really shines once you start your n observed instance at 3+.


Over 98% of variation in EmaiLO rating at n+1 is described by rating at n.  As I described in the introductory post, this is exactly what we want to see.  The strong relationship from month to month implies that our EmaiLO ratings are pretty accurate, and when they aren’t, they adjust pretty quickly.  This is important to ensure that our email ratings – and therefore, our rankings – are accurate reflections of program performance and overall email marketing health.

Overall all, the blend of predictive elements, descriptive elements, and appropriate calibration with a very basic model (the only inputs are open rate, click rate, session rate, and a few scaling parameters) is the reason I find EmaiLO (and other ELO based systems) so interesting.  I hope you think so too!

Questions, comments, concerns?  Shoot a tweet to @davisj2007 and get a conversation started.