Tim Watson

Email Marketing Data – Can You Trust It?

13 October 2015, Tim Watson



Statistical signficance

The problem with split testing in email marketing is that having seemingly beyond question hard facts in the form of metrics, gives a false sense of security about robustness of the numbers, and decisions made because of them.

But in reality the wrong conclusions may be drawn, with the potential for making decisions that decreases performance.

Let me explain with a quick tale of woe. A real world A/B subject line split test that was published by a brand in their blog.

Subject line Open Rate Click Rate
A Hey [recipient’s first name], Here’s the Social Media Status you Need to Know for 2014 35% 3%
B Social Media Status you Need to Know for 2014 37% 4%

 

Subject line B gave a 5.7% increase in open rate and big 25% improvement in click rate.

The conclusion appears obvious. Don’t use name personalisation, right?

That’s the conclusion the brand in question made and they stopped using name personalisation.

In reality the test proved nothing. The problem was the test sample size was insufficient, just 100 people in each test cell.

That meant the 35% to 37% open rate increase was just two more people and the 3% to 4% click rate was one extra person clicking.

That’s like tossing a coin four times and because it came up heads 3 out of 4 concluding the coin is biased to heads. A much larger sample size is needed to give robust results.

To separate normal random variation in real world experiments from a true difference needs a bit of statistics. Statistical significance is the key weapon. This tells us if the difference seen between two samples is down to random variation or not.

For this particular brand they convinced themselves they shouldn’t personalize, though in many tests I’ve seen personalized improve performance. Perhaps they were right not to, but the point is they are now making future decisions based on potential wrong facts.

Thankfully you don’t need to be a statistician to check your results, just plug the numbers into a specialist online calculator, such as this free A/B split test significance calculator for a simple check of test results.

But before you go away happy, that’s not enough. Statistical significance doesn’t give you the whole picture either.

Statistical significance answers one particular question but there are several questions it doesn’t address.

The Impact of New

Say you change your from name or creative design and run a split test of the new design against the old design.

You check your result for statistical significance and the new design wins. Hurrah.

But we’re dealing with people. Unlike tossing a coin, in which the previous result has no impact on future result, people’s behaviour changes based on previous experience.

An audience may have become bored and numb to a particular design.  By changing the design it re-engages the audience. Not because it’s fundamentally better, but because it’s new and different to previous experience.

Over time the audience become used to the new design the initial uplift may drop off, performance may reduce to the same as the original control or worse.

To be certain the original difference seen is not just due to new factor, the test needs to be repeated after the audience has experienced the changed version a few times.

Re-testing after six campaigns mitigates this issue.

Accuracy of Uplift

Imagine a test with a sample size of 500 and a click rate increase from 4% to 7%.

That uplift of 75% has statistical significance of 95%.

Statistical significance tells you that it’s 95% certain that there is a real difference between your control and treatment versions. But it doesn’t mean that there is a 95% chance that the uplift is 75%.

Repeat the test and you may see an uplift of 40%. This is important when making ROI decisions or revenue predictions based on test results.

The effect is particularly acute on smaller sample sizes, where huge uplifts can be seen that once rolled out to the full database become much smaller gains.

Re-testing with larger sample sizes or repeating the test over several campaigns mitigates this issue.

Blind Spots

These are the gremlins that can make a mess of your well controlled test, the external factors of which you may have little or no knowledge.

  • Technology failures. Perhaps something went wrong with the random sampling process or tracking of results causing one cell to be skewed? Did that subject line change cause one version to deliver to junk? It does happen.
  • Your activity in your other channels or that of your competitors. What messages and images could you email audience see through other marketing activity? Could that change how they perceive your subject line test?
  • Did the test get executed exactly as you intended? With more complex tests an error in test plan execution could mean the results aren’t telling you what you think they are.

Repeating tests over several campaigns and reviewing test process as well as results mitigates this issue.

So in summary, ensure sufficient sample sizes, check results for statistical significance, repeat tests over time and review robustness of execution. If everything then still lines up you’ve a real winner on your hands.

  • Nikki Parker

    Nice post. Noticed Subject Line A says ‘status’ compared with B that says ‘stats’. Is that how they went out too? If so, that could also be a factor.

  • Thanks for spotting the error Nikki, you’ve sharp eyes. It should be stats, I’ll get the post corrected.

  • Nikki Parker

    Thanks for the reply. I’m nicknamed ‘Eagle Eyes Parker’ here. I don’t even mean to spot these things, they just jump out at me. 😐

  • Pete Austin

    Re: Statistical significance is the key weapon.

    Pretty sure you mean “practical significance”, for example see here:
    http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/introductory-concepts/p-value-and-significance-level/practical-significance/

    But the general advice is totally correct. If you’re just A:B testing for one email then being a bit sloppy doesn’t matter. But if you’re going to take long-term action based on a test, then you’d better be methodical and make sure the results stack up.