The CRO Math: How to Calculate Whether Your A/B Test Is Worth Running Before You Run It

Sample size, runtime, and significance for ecommerce A/B tests — plus when the math says do not test.

May 19, 20268 min readWebsite Growth

Most A/B tests in DTC are unwinnable before they start. Not because the idea was bad. Because the traffic was wrong, the MDE was wrong, or the runtime was wrong. The math tells you all three in roughly 30 seconds. Most operators skip the math and run the test anyway. Then they read a 1.2% lift after six weeks and treat it like a signal. The signal is noise wearing a suit. This post is the napkin version of A/B test math: the formula you need, three worked examples at common DTC traffic levels, and the rule we use to kill a test before it burns a quarter.

The three inputs you need

Three numbers determine whether your test will produce a real answer.

That is the whole input list. Three numbers.

The math, in plain language

Standard A/B test sample size, for 95% confidence and 80% power, simplifies to:

Sample size per variant ≈ 16 / (baseline × MDE²)

Both inputs as decimals. Baseline 2.5% \= 0.025. MDE 10% relative \= 0.10.

Plug it in for a baseline of 2.5% and an MDE of 10%:

16 / (0.025 × 0.01) \= 16 / 0.00025 \= 64,000 visitors per variant

For both variants combined, that is 128,000 visitors. At 5,000 weekly visits per variant, the test runs for 12.8 weeks.

Three months. To detect a 10% relative lift on a 2.5% baseline. With no seasonal contamination. With no other tests running. Tests that long are unreliable for reasons we will hit in a minute.

Three worked examples at common DTC traffic levels

Example A: Small DTC, $20K/month, 500 weekly PDP visits per variant

Baseline CVR: 2.0% (0.02)

MDE: 20% relative lift (anything smaller is below your noise floor)

Sample size needed: 16 / (0.02 × 0.04) \= 20,000 per variant

Runtime: 20,000 / 500 \= 40 weeks

Forty weeks. Almost a year. The test is dead before you press start. Your traffic budget is too small for meaningful tests on the homepage or PDP. You are testing on vibes whether you write up the result or not.

Honest answer for this tier: do not run A/B tests on site conversion. Run them on email subject lines or ad creative, where the audience is bigger and the math is on your side. For site CRO, use judgment and intentional design choices instead. Test a redesign with cohort tracking month over month, not split traffic.

Example B: Mid-DTC, $200K/month, 4,000 weekly PDP visits per variant

Baseline CVR: 3.0% (0.03)

MDE: 15% relative lift (a meaningful change, not a button-color twitch)

Sample size needed: 16 / (0.03 × 0.0225) \= 23,704 per variant

Runtime: 23,704 / 4,000 \= 6 weeks

Six weeks is on the edge. You will pick up some weekly noise. Seasonal drift starts to bite if you cross a promo window or a paid-spend change.

Real move for this tier: bump your MDE to 20% relative. That gets you to ~13,300 per variant, 3.3 weeks runtime, which is the sweet spot. You will test fewer things, but the tests you run will settle.

Example C: Bigger ecom, $750K+/month, 15,000 weekly PDP visits per variant

Baseline CVR: 3.5% (0.035)

MDE: 10% relative lift

Sample size needed: 16 / (0.035 × 0.01) \= 45,714 per variant

Runtime: 45,714 / 15,000 \= 3 weeks

This is the comfort zone. You have enough traffic to test meaningful changes in a clean window. Your math says go. The risk at this tier is the opposite: testing too many things at once, or testing things that should have been a design decision instead of a split.

The 4-week rule

If the runtime math says your test takes longer than 4 weeks, kill it.

Reasons your test corrupts past week 4:

Seasonal drift. Black Friday energy in mid-November is not the same as the energy of the first week of December. Your variants pick up different buyer cohorts.

Audience drift. Returning visitors weighted differently than first-time visitors. Email lists pushed during the test send a spike of one audience type to both variants, but on different days.

Ad creative drift. If you change your Meta creative during the test, the audience composition changes. Your test is no longer testing what you think.

Promotional contamination. A sitewide 10% off run for a week wipes the baseline.

A test over 4 weeks in DTC is a test of "did the world change while we measured" more than "does B beat A."

If your test does not fit under 4 weeks at your current traffic, you have three options.

The MDE trap

The most common mistake we see in DTC operator-run tests: an MDE of 3% or 5% on a 2-3% baseline.

A 3% relative lift on a 3% baseline is a 0.09 percentage point absolute difference. To detect it at 95%/80% with a 3% baseline:

16 / (0.03 × 0.0009) \= 592,593 per variant

You would need over a million total visitors to that surface to see a real signal. At 5,000 weekly per variant, that is 118 weeks. Two and a quarter years.

Nobody runs a test that long. So what happens: the test runs 4 weeks, hits some random fluctuation, shows a "winner" with 87% statistical confidence (the calculator default), and the operator ships the change. The change is a coin flip wearing a confidence interval.

If you only care about 3% relative lifts, you should not be running split tests. You should be redesigning surfaces meaningfully and measuring period-over-period.

Quick gut-check rules

These are the rules we apply on every test before we approve runtime.

Below 500 conversions per variant in your runtime window? Kill the test. Below this floor, the result is dominated by random visitor variance.

Runtime over 4 weeks? Kill the test or raise the MDE.

MDE below 5% relative? Raise it or kill the test. Tests this sensitive will lie to you.

Baseline below 1%? You need much bigger sample sizes than this article covers. Test higher-funnel surfaces first.

Multiple tests running on the same traffic? Run them sequentially or use a proper experimentation platform with mutual exclusion. Otherwise you are reading interference noise.

If your test passes those five gates, run it. If it does not, the math is telling you to stop trying to test and start trying to decide.

What to do instead, when the math says no

Most small and mid-DTC sites should A/B test less and design with more conviction.

A few moves that beat underpowered split-testing:

Cohort redesign tracking. Ship a redesigned PDP. Track the next 30 days against the prior 30 days, controlling for traffic source and AOV. Imperfect, but cleaner than a 40-week split test.

Qualitative research. Five customer interviews tell you more about your PDP than 5,000 underpowered visits to a button-color test.

Reading the actual decision being made. The 5-Decision Funnel post linked below is the framework we use. Figure out which of the five shopper decisions is leaking. Fix the decision with intentional creative work. Skip the test.

The takeaway

The math is brutal for most DTC sites: your traffic does not support the tests you want to run, your MDE is too small, your runtime is too long. The honest answer is to test less and design with more conviction.

When you do test, run the math first.

Baseline CVR, MDE, weekly traffic per variant.

Sample size \= 16 / (baseline × MDE²).

Runtime \= sample size / weekly traffic per variant.

Kill if runtime \> 4 weeks, or sample per variant \< 500 conversions.

Three inputs. One formula. Thirty seconds.

Grab the calculator

We built the spreadsheet version of this math. Type in your baseline, your MDE, and your weekly traffic. It tells you whether to run the test, how long the test will take, and what MDE you would need to make the test feasible at your current traffic.

[Get the A/B test calculator spreadsheet](/contact). Free. Email-only signup.

For more on how we think about CRO before we ever get to a split-test, read [the 5-Decision Funnel](/blog/the-5-decision-funnel-dtc-conversion). That is the underlying framework. The math here is what we use to decide whether a hypothesis from that framework is worth A/B testing or whether to ship the change outright.

Spreadsheet template for sample size and runtime (request via contact).

grab the calculator at the end of the post

Want help applying this on your store? Start a project conversation.

The CRO Math: How to Calculate Whether Your A/B Test Is Worth Running Before You Run It

💡The three inputs you need

📌The math, in plain language

🚀Three worked examples at common DTC traffic levels

Example A: Small DTC, $20K/month, 500 weekly PDP visits per variant

Example B: Mid-DTC, $200K/month, 4,000 weekly PDP visits per variant

Example C: Bigger ecom, $750K+/month, 15,000 weekly PDP visits per variant

✨The 4-week rule

📊The MDE trap

🔍Quick gut-check rules

💬What to do instead, when the math says no

✅The takeaway

🧠Grab the calculator

More from the blog

The 5-Decision Funnel: Why DTC Conversion Is About Reducing Decisions, Not Removing Friction