
A few years back I was consulting for a retail brand running a product page test in Adobe Target. The variant had bolder CTAs and tighter copy. After two weeks, the numbers looked... fine. Flat. The control won by a hair and the team was ready to call it and move on.
Something felt off. The control had 52,000 sessions. The variant had 46,000. We'd set it to a 50/50 split. That 6,000-session gap shouldn't exist in a balanced allocation.
We ran a chi-squared test. p-value: 0.0001. The test was broken.
That was a sample ratio mismatch, and it had silently invalidated two weeks of data.
What SRM Actually Is
Sample ratio mismatch (SRM) happens when the observed visitor counts across variants don't match the ratio you configured. Set a 50/50 split and get 53/47 on 5,000 sessions? Might be noise. Get 53/47 on 100,000 sessions? Almost certainly not.
Detection is a chi-squared goodness-of-fit test comparing observed counts against expected. Microsoft's ExP team uses a threshold of p < 0.0005 for their internal platform. Most practitioners use 0.01. Either way, if you hit it, the test data is compromised and you should not act on the results.
Here's the uncomfortable part: this is not rare. Microsoft's ExP team found SRM in roughly 6% of their internal A/B tests. LinkedIn reported closer to 10% in certain test cohorts. A company running 100 tests a year has somewhere between 6 and 10 of them quietly producing invalid results. And most teams never check.
Where SRM Hides in Enterprise Setups
The Microsoft Research team mapped SRM to four stages in the experiment pipeline. Knowing which stage you're in narrows the diagnosis fast.
Assignment
The most fundamental failure point. Your randomization is splitting users into buckets, but the split isn't landing correctly. Common causes: user ID inconsistency (mixing logged-in vs. anonymous IDs mid-session), carryover from a previous test that used the same buckets, or uneven ramp-up where someone turned the variant on for a slice of traffic first and the logs got mixed.
In Adobe Target specifically, I've seen this happen when the mbox fires inconsistently because of async loading. The user gets assigned but the assignment doesn't log before the page unloads. That missing log shows up as fewer users in the variant, not fewer page views overall.
Execution
Redirect tests are the worst offenders here. When your variant is a full-page redirect, some users get counted at assignment but drop off before the redirect completes. Bot detection can also cause this asymmetry: if bots hit your control disproportionately, your control sample inflates.
The MSN image carousel test at Microsoft is a concrete example. A test that looked like a negative result turned out to have SRM because users engaged enough to trigger bot-filtering were clustered in one variant. Once the SRM was accounted for, the conclusion flipped to positive. Two completely different business decisions depending on whether you caught it.
Log Processing
This one is insidious because the experiment itself is fine but your analysis is wrong. A bad join between your assignment table and your conversion table creates a phantom mismatch. Maybe your analytics event fires on 95% of sessions but your experiment assignment logs 100% of them.
If you're running Adobe Target with Adobe Analytics via the A4T integration, watch for this. A4T stitching relies on a supplemental hit, and if that hit doesn't fire consistently, users drop from Analytics reporting but not from Target's built-in report. You get two different session counts and neither is obviously wrong until you compare them directly.
Analysis
The sneakiest kind. Your test is fine, your pipeline is fine, but someone applies a post-hoc filter: "let's look at mobile only" or "let's exclude users who bounced in under 5 seconds." If that filter applies differently across variants, you've introduced the bias yourself. This is especially common when you segment by a metric that the treatment itself can influence.
How to Detect It
The mechanics are straightforward. Chi-squared test, two groups, comparing observed sizes against expected. In Python: scipy.stats.chisquare. In R: chisq.test. In a spreadsheet: CHISQ.TEST.
Platforms like Statsig and Eppo flag SRM automatically before you see any lift metrics. If you're on a platform that doesn't check by default (and several enterprise tools still don't), build your own check. It's ten lines of code and it should run before any result gets surfaced to stakeholders.
One practical habit: check SRM within 48 hours of launch, not just at the end of the test. If there's a redirect issue or a broken firing condition, you can catch it early and restart before you've burned two weeks of traffic.
What to Do When You Find It
Stop. Do not declare a winner. Do not try to segment your way to an answer by filtering to a cleaner-looking date range or device type. The data is compromised in ways you can't fully see, and any slicing you do will mix biased and unbiased observations in unknowable proportions.
Investigate in this order:
- Compare your assignment logs to the configured split at the bucketing step, not just in your analytics tool
- Check for any redirects in the variant that don't exist in control
- Look for bot filtering rules that apply asymmetrically across variants
- Verify your analytics firing condition is identical in both variants
- Check for mid-test changes: audience enablement, segment rollouts, traffic spikes from a campaign
Once you've found and fixed the root cause, restart the test with a clean date range. Don't try to salvage the old data.
Make It a Pre-Readout Habit
The experimentation programs I've seen produce the most trustworthy results do three things before calling a winner: check statistical power before launch, check SRM within 48 hours of going live, and hold results until both pass.
No ML model required. Just ten minutes and a chi-squared test. Given that 6-10% of enterprise tests have this problem, that's about the best ROI you can get on ten minutes.
Comments
Post a Comment