HomeAnalysisPart 2: “Minimum Detectable Effect” as a function of sample size
Part 2: “Minimum Detectable Effect” as a function of sample size
April 12, 2021
Making MDE the subject of the equation
In part 1, I discussed why we need to move away from MDE as an input variable to an output variable and make it a function of the available sample size. In this post, I’ll share what the rearranged formula looks like, along with a hacky Microsoft Excel solution.
As a reminder, these are the formulas for sample size where:
– α: the selected level of significance
– β: the selected power
– σ: the standard deviation
– μ1 or p1: the baseline mean or proportion
– μ2 or p2: the proposed/expected new mean or proportion (this is where our MDE currently sits) – r: ratio of groups (usually 1 assuming equal sample size)
– ni: sample size per variant
Minimum detectable effect as a function of sample size
Assumptions: In the formulas above, you’ll notice the variable r, which is the ratio of test groups. For the remainder of this post, we’ll be assuming equal sample sizes and therefore having r = 1.
MDE for means (e.g. AOV, ARPU, Order Frequency)
Starting with MDE for means because it’s easier to rearrange.
MDE for proportions (e.g. conversion rate, sign up rate)
Making the MDE the subject of the formula for proportions was bloody difficult and I’m giving full credit to Mateusz Raczyński in my team for solving it. In fact, it was so complicated, he had to split elements of the formula into manageable variables just to make it more digestible, see below.
Using these formulas you’ll be able to create an output similar to the table below, where for each week, you can see what the detectable sample size is and how much smaller it gets as sample size increases.
I touched on why this was a better way to think sample sizes/MDE in part 1 of this two-part post, but hopefully, the image above makes it clearer. The beauty of this approach is that experiment owners no longer need to try and guess what impact a test feature will have on the primary metric. All they need to do is making a judgement call on how long they can afford to run an experiment. If they believe the impact of a new feature is high, they can choose to run a test for a shorter period of time or if they’re unsure about the magnitude of impact they might choose to run the test for longer and ensure it will be powered enough to detect smaller changes.
Fixed horizon hypothesis testing rules still apply
Please keep in mind that the same rules of frequentist hypothesis testing still apply. The table does not give you permission to keep moving the goal post if you don’t reach significance on the chosen sample size. You pick a sample size and you run the test, if you don’t get a significant result move on (iterate, scrap, analyse, etc… whatever move on means for you).
If the rearranged formulas above give you a headache (the formula for proportions MDE definitely gives me a headache), then you can try a Microsoft Excel hack (or something similar in R or Python) which is what I used to do and the reason we ended up here.
The Excel secret is to use the built-in “Goal Seek” tool to work the Sample Size formulas backwards.
Start by creating a formula for sample size in excel cells (image 1 below shows an example of calculating sample size for means).
Next, go to Data > What-If-Analysis> Goal Seek
In the dialogue box, “Set Cell” to the “Total Sample” cell and “To Value” to your available sample size for one week. The “by changing cell” should be your MDE. Note: it doesn’t matter what values are in there to begin with as we’re going to tell Excel to calculate the MDE based on the sample size we want it to Goal Seek for.
Excel will then cycle through a range of values until it converges to the desired value. (sorry about the poor quality gif). You could repeat this process for week 2, week 3… week n to create a table similar to the one above.
I hope you found this two-part blog post useful. My goal was to help people get away from the trial and error method of dealing with the MDE. We may not have historical tests or research in which we can make robust assumptions about effect sizes, but we have the luxury of knowing what traffic is available to us and how much of it we want to use to validate features. Balancing statistical knowledge with sensible decision making will ensure that you run tests for an appropriate amount of time.
You can follow me on Twitter or with minimal effort you can follow me and CRAP Talks on Medium.