Part 2: “Minimum Detectable Effect” as a function of sample size
Making MDE the subject of the equation
In part 1, I discussed why we need to move away from MDE as an input variable to an output variable and make it a function of the available sample size. In this post, I’ll share what the rearranged formula looks like, along with a hacky Microsoft Excel solution.
As a reminder, these are the formulas for sample size where:
– α: the selected level of significance
– β: the selected power
– σ: the standard deviation
– μ1 or p1: the baseline mean or proportion
– μ2 or p2: the proposed/expected new mean or proportion (this is where our MDE currently sits)
– r: ratio of groups (usually 1 assuming equal sample size)
– ni: sample size per variant
Minimum detectable effect as a function of sample size
Assumptions: In the formulas above, you’ll notice the variable r, which is the ratio of test groups. For the remainder of this post, we’ll be assuming equal sample sizes and therefore having r = 1.
MDE for means (e.g. AOV, ARPU, Order Frequency)
Starting with MDE for means because it’s easier to rearrange.
MDE for proportions (e.g. conversion rate, sign up rate)
Making the MDE the subject of the formula for proportions was bloody difficult and I’m giving full credit to Mateusz Raczyński in my team for solving it. In fact, it was so complicated, he had to split elements of the formula into manageable variables just to make it more digestible, see below.
Using these formulas you’ll be able to create an output similar to the table below, where for each week, you can see what the detectable sample size is and how much smaller it gets as sample size increases.
I touched on why this was a better way to think sample sizes/MDE in part 1 of this two-part post, but hopefully, the image above makes it clearer. The beauty of this approach is that experiment owners no longer need to try and guess what impact a test feature will have on the primary metric. All they need to do is making a judgement call on how long they can afford to run an experiment. If they believe the impact of a new feature is high, they can choose to run a test for a shorter period of time or if they’re unsure about the magnitude of impact they might choose to run the test for longer and ensure it will be powered enough to detect smaller changes.
Fixed horizon hypothesis testing rules still apply
Please keep in mind that the same rules of frequentist hypothesis testing still apply. The table does not give you permission to keep moving the goal post if you don’t reach significance on the chosen sample size. You pick a sample size and you run the test, if you don’t get a significant result move on (iterate, scrap, analyse, etc… whatever move on means for you).
If the rearranged formulas above give you a headache (the formula for proportions MDE definitely gives me a headache), then you can try a Microsoft Excel hack (or something similar in R or Python) which is what I used to do and the reason we ended up here.
The Excel secret is to use the built-in “Goal Seek” tool to work the Sample Size formulas backwards.
Start by creating a formula for sample size in excel cells (image 1 below shows an example of calculating sample size for means).
Next, go to Data > What-If-Analysis> Goal Seek
In the dialogue box, “Set Cell” to the “Total Sample” cell and “To Value” to your available sample size for one week. The “by changing cell” should be your MDE. Note: it doesn’t matter what values are in there to begin with as we’re going to tell Excel to calculate the MDE based on the sample size we want it to Goal Seek for.
Excel will then cycle through a range of values until it converges to the desired value. (sorry about the poor quality gif). You could repeat this process for week 2, week 3… week n to create a table similar to the one above.
I hope you found this two-part blog post useful. My goal was to help people get away from the trial and error method of dealing with the MDE. We may not have historical tests or research in which we can make robust assumptions about effect sizes, but we have the luxury of knowing what traffic is available to us and how much of it we want to use to validate features. Balancing statistical knowledge with sensible decision making will ensure that you run tests for an appropriate amount of time.
You can follow me on Twitter or with minimal effort you can follow me and CRAP Talks on Medium.
I am really eager to use this as I 100% agree, estimating an MDE for a given sample size is far more practical for planning tests. However, no matter how many times I try, I can’t get the sample size estimation for a proportion test to match the calculation set out my one of my predecessors following this: https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a. The method outlined here gives a result that matches a number of online calculators, e.g. https://www.optimizely.com/sample-size-calculator, https://www.abtasty.com/sample-size-calculator, and https://www.evanmiller.org/ab-testing/sample-size.html. Any thoughts on whether I have gotten something wrong in my calculations, or does your method differ? As an example, looking for 2% relative uplift from cvr of 50%, your method says a sample of 55,494 is required, but the others come in at around 39,200. The reason I was checking the sample size estimation is because the formula you provided for MDE was conflicting with known data points, so I was trying to close the loop and work out where I was going wrong. Thanks in advance if you can help!
Should u1 be at the denominator for this MDE calculation?
I tried replicating and noticed an error in OP’s equation. There should not be a square root. I was suspicious when I found a similar equation here but without the square root:
If you want to maintain the “ratio” aspect, I had success by converting it to 1/r instead.
numerator = ((z_alpha + z_beta)**2) * ((p_1*q_1)+ (p_2*q_2)*(1/k))
denominator = (p_1 – p_2)**2
I’ve gotten it near equal when k = 1, and the difference diverges as k strays further from 1 but still reasonable.