Why we moved away from Conversion Rate as a primary metric
The impact of violating assumptions of independence on AB Tests and changing the primary metric for your company
Changing the primary metric for any company is not a straightforward thing to do. Why you may ask? There are dozens of reasons but here are some of the ones that come to my mind:
- The company has been using that metric for a long time
- Forecasts have been made from historical values
- Team OKRs include targets against it
- People understand the metric and the levers that drive it
- Deviations from the expected values can easily be spotted
- Causes for change to the metric can be quickly identified
This short list alone would cause anyone to break out into a sweat! So imagine my surprise when in my first month at Gousto, the Head of Product for our Menu Tribe suggested that our primary metric, Menu Conversion Rate (MCVR) was likely to be the wrong metric for us to be using for our experiments as it violated assumptions of statistical independence.
Violated the who with the what now???
When you use conversion as a metric for your experiments, you do so, either consciously or subconsciously, under the assumption that you have a sample of independent observations.
This basically means that the outcome of one event has no impact on the outcome of another event.
Example: Imagine rolling a dice two times. The outcome of the first roll will have no impact on the outcome of the second roll, therefore we can say these events are independent.
Any company with a “sticky” product and a business model that relies on customers coming back week in and week out, is at risk of violating this assumption. This is not a “best practice” type of decision that you can choose to ignore. The basis for most inferential statistical tests used to analyse the results of an AB test(e.g. 2 sample t-test) is that one assumes the observations are independent of each other.
Maybe it isn’t clear yet or you’ve dismissed this as something that has no impact on your tests. Let me assure you, even the slightest dependence can lead to heavily biased results
Why we were violating assumptions of independence (AOI)
We’re a subscription-based service where customers come back week in and week out to order their meals so naturally, we were using weekly MCVR as our primary KPI. We didn’t look at a daily or session-based MCVR because most customers order once a week, so even if that happened over multiple sessions, it made sense to only count them once that week. The following week customers (some new and some from the previous week) would come back and place an order and we would count them again as unique. We would do this over and over again for the duration of an experiment, typically 3–5 weeks. At the end of the experiment, we would sum the total number of unique weekly visitors and orders and divide one by the other to give us our MCVR*.
*Note: The alternative way to calculate MCVR would be to just take the unique users and the unique orders for the entire duration of the experiment and divide the former by the latter. This approach is also not without its limitations. Whilst the observations are independent, you ignore order frequency but that is beyond the scope of this post.
This seems like a reasonable approach at first and it might have been if we didn’t have a sticky product and users from week 1, 2 etc… were never to be seen again (at least for the duration of the test). But as users came back week after week, our observations were not independent. Why you may ask?
Well think about it — if we developed a feature or reduced our pricing or did something else that convinced you to make a purchase, that thing we did, it’s already worked on you, so counting you again and again would mean that our observation of you as a data point is not independent and we’re biasing our results. To put it another way, what customers do a given week, will effect what they do the following week. Our primary KPI of MCVR was not valid.
What’s the impact of this and is it actually serious?
It increases your false-positive rate, FPR (or type 1 error rate) so yes, it’s serious! Frequentist statistics already relies on accepting that some percentage of all winning results will be false based on the significance level you set — usually 5% (α=0.05). Running tests with observations that are not independent will result in further inflation of your type 1 error rate.
A false-positive rate is the probability of rejecting the null (control) hypothesis given that it is true. If you ran 100 experiments all with a significance level of 5% and had 10 winners. 5 of them might have been false positives*. The question is, which 5. Isn’t probability a bitch!
I knew our false-positive rate was higher than the threshold we set based on research of papers I read elsewhere. I just didn’t know by how much.
*this is a simplified view of the world in reality they could all be false positives or none of them or somewhere in between.
Discovering our true false-positive rate
If we were going to change the companies primary metric for experiments, we needed to prove that it was the wrong metric and we were making bad decisions off the back of it.
The way we proved it was to take some historical data and run A/A-test simulations on it to show what our true false-positive rate was. Using a random and recent data set, I simulated 30,000 A/A tests for 1, 3 and 5-week “experiments”.
For 1 week simulations, I expected the false-positive rate to have been 5% as we’re only counting users once (independent) and I set the significance level to 5%. Beyond that, my only expectation was that we would see more than 5% “winning tests” because we were counting users who came back, over and over again.
Our 1-week false-positive rate was as I’d expected, 5%, but I discovered that we had false-positive rates as high as 10% for tests that ran for 3 weeks and nearly 12% for 5-week tests (figure 1 below).
To put this into context. The industry average of winning tests is somewhere between 20–25% which means that potentially up to 50% of those winning tests are at risk of being false winners instead of just 20%.
Figure 1 below starts off a bit “choppy” but as the number of simulations begins to increase the false positive rate starts to stabilise and we can see that on our random data set, even without running any experiments, we’re going to find “winners”.
Armed with this newfound perspective of the world, we began our hunt for a new metric.
The New Metric — Average Orders Per User (AOPU)
I mentioned earlier that we were ignoring order frequency and just focussing on conversion (the incorrect dependent version), so the goal was to find some type of hybrid (compound) metric that looked at unique conversion and order frequency. I had an idea of what this metric would be, AOPU, I just needed to help everyone understand it. What followed was a painful couple of months of writing internal papers about why we were moving away from conversion? What was the new metric? How was it calculated? Why was it better? The TL;DR version of my paper was:
AOPU = Total Orders / Unique Users = Conversion x Order Frequency
This proposed metric allows us to monitor sensitivities and changes in conversion and order frequency without violating assumptions of independence giving us a more well rounded and statistically sound metric.
We’ve been using AOPU successfully for nearly 8 months now. Conversations in meetings, OKRs, analysis, and experiments are all based on this new metric. Discussions around menu conversion is a thing of the past.
This was quite possibly one of the scariest pieces of work I’ve ever done in my life but I have to admit, it has been one of the most rewarding.
We’re not the only subscription-based business in the world that relies on users coming back, again and again. You don’t even have to have a subscription model to have a “sticky” product. Supermarkets, Amazon, Spotify, Slack, Netflix etc… all have customers who come back regularly. Whatever your business model, if you have customers coming back regularly and you’re looking at daily/weekly conversion rate like we were, run some A/A tests on your own data and see what your own FPR looks like. Who knows, you might even need to change your own primary metric.
You can follow me on Twitter or with minimal effort you can follow me and CRAP Talks on Medium.