A/B Test Calculator: The Technical Standard for Statistical Significance
The UXR Institute A/B Test Calculator is a precision instrument for quantitative research and behavioral validation. By utilizing a frequentist two-proportion z-test, this tool provides the mathematical rigor necessary for decision science, helping researchers determine statistical significance while mitigating the risk of Type I and Type II errors in the product roadmap.
Frequentist · Two-prop Z-test
A/B Test Calculator
Evaluate statistical significance and test power — for pre-test planning or post-test analysis.
Test Data
Settings
How to read results
p-value — how surprising the result is if there is truly no difference.
Power — how likely you are to detect an effect (shown two ways).
Z-score — difference measured in "standard errors."
Power — how likely you are to detect an effect (shown two ways).
Z-score — difference measured in "standard errors."
📊
Awaiting calculation
Enter your test data above
⚠️ Low data warning: When conversions are very low, the normal-approximation z-test can be unreliable.
🚨 Sample Ratio Mismatch (SRM): The visitor split is far from what you expected.
CR Control (A)
?
—
Conversions / Visitors
CR Variant (B)
?
—
Conversions / Visitors
Relative uplift
?
—
(CRB − CRA) / CRA
p-value
?
—
—
Z-score
?
—
Standard deviations
Test power
?
—
Power of two-prop z-test
CI threshold power
?
—
—
Sampling distributions — Control A vs Variant B
?
Control (A)
Variant (B)
| Standard Error A ? sqrt( CRa × (1−CRa) / Va ) | — |
| Standard Error B ? sqrt( CRb × (1−CRb) / Vb ) | — |
| Std. Error of Difference ? sqrt( SEa² + SEb² ) | — |
| Critical Z-value ? At chosen confidence level | — |
| A CI Upper Threshold ? CRa + Zcrit × SEa | — |
| A CI Lower Threshold ? CRa − Zcrit × SEa | — |
| Alpha (α) ? 1 − confidence level | — |
What this calculator is for
Use this tool to answer two practical questions:
- Did the variant likely change behavior? The calculator estimates how surprising your observed difference would be if there were actually no difference between A and B.
- How confident can I be in the result? It reports p-value, z-score, and two power readouts so you can judge whether your study was capable of detecting the effect you care about.
This is a frequentist two-proportion z-test (for conversion rates: “converted” vs “didn’t convert”). It’s appropriate when each visitor has one chance to convert and visitors are independently assigned to A or B.
Before you trust any result: check experiment quality
A/B math assumes the experiment was run cleanly. If these assumptions don’t hold, the p-value can be misleading (often too optimistic).
1. Random Assignment and Stable Exposure
People should land in A or B randomly and stay there. If you have cross-device issues, changing bucketing rules, or mid-test routing changes, results can be biased.
2. Sample Ratio Mismatch (SRM) warning
If you intended a 50/50 split but got something like 60/40, this tool will flag an SRM warning. SRM is a classic sign something is off (mis-bucketing, eligibility filtering, instrumentation gaps, bot traffic, caching, etc.). Treat SRM as an experiment validity issue—not just a stats detail.
3. Low conversion counts
When conversions are very low, normal-approximation z-tests get shaky. The calculator warns when a cell has fewer than ~5 conversions. In that case, consider:
- running longer,
- aggregating a less-rare metric, or
- using an exact test / Bayesian approach.
What to enter
Visitors = number of unique users exposed to that variation (A or B).
Conversions = number of those visitors who achieved the outcome (e.g., “completed signup”). This calculator assumes each visitor is counted once per variation and your metric is binary (converted vs not).
Conversions = number of those visitors who achieved the outcome (e.g., “completed signup”). This calculator assumes each visitor is counted once per variation and your metric is binary (converted vs not).
How to read the core outputs
Conversion rate (CR)
CR is simply:
- CR(A) = conversionsA / visitorsA
- CR(B) = conversionsB / visitorsB
The “Relative uplift” is:
(CR(B) − CR(A)) / CR(A)
This is a useful UX framing (“how much better is B?”), but remember it can look huge when the baseline is small.
Z-score
The z-score is how many “standard errors” away your observed difference is from zero.
Bigger absolute z means the difference is less likely due to random chance.
Bigger absolute z means the difference is less likely due to random chance.
A z-score is helpful for intuition:
- around 1.28 lines up with ~90% one-sided
- around 1.96 lines up with ~95% two-sided
- around 2.58 lines up with ~99% two-sided
p-value
The p-value is not the probability that your variant works.
It’s:
If there were truly no difference between A and B, how often would we see a result at least this extreme just by random fluctuation?
So:
p = 0.03 means “about 3% of the time, chance alone would produce a difference this large (or larger).”
Smaller p-values mean stronger evidence against “no effect.”
Confidence level and the “significant / not significant” label
Confidence determines the cutoff:
- 90% confidence ⇒ α = 0.10
- 95% confidence ⇒ α = 0.05
- 99% confidence ⇒ α = 0.01
This tool marks results “Statistically Significant” when:
- p < α
That label is a decision rule, not a guarantee of user impact.
One-sided vs two-sided (choose based on your study intent)
One-sided: “B is better than A”
Choose one-sided when:
- you only care whether the variant improves the metric, and
- you would not act on (or even interpret) a negative effect the same way.
Typical UX use case: a guided flow change intended to increase completion.
Two-sided: “B is different from A (better or worse)”
Choose two-sided when:
- any change matters (including harm), or
- you’re exploring, validating, or risk-managing.
Typical UX use case: high-stakes changes where regressions matter (trust, safety, revenue, retention).
If you’re unsure, default to two-sided.
Power: why you might see two different numbers
Many online calculators report “observed power,” but they don’t all mean the same thing. That’s why this tool shows two power readouts—both are valid, but they answer different questions.
1. Test Power (Z-test)
This is the probability your two-proportion z-test would declare significance if the true effect were the one you observed.
It aligns with the actual test decision rule.
It aligns with the actual test decision rule.
Interpretation:
- ≥ 80% is often used as a planning target.
- If Test Power is low, “not significant” may simply mean “not enough data,” not “no effect.”
2. CI Threshold Power (common in other calculators)
This estimates the probability that B would clear a confidence bound around A (a “confidence interval threshold” framing). Some tools use this as their default “observed power.”
Interpretation:
- It often looks higher than Test Power, because it’s tied to a different rejection criterion.
- Useful if your team uses CI-style decision rules (“ship if B clears A’s bound”), but it’s not identical to the z-test power.
Bottom line: if you’re using the p-value/z-test to make decisions, treat Test Power as the primary power metric.
A UX-researcher-friendly way to interpret outcomes
Think of results as a combination of evidence strength and decision risk:
- Significant + high power: strong evidence; effect is likely real and the study had enough sensitivity.
- Not significant + low power: inconclusive; you may need more traffic, a higher-frequency metric, or a larger effect.
- Significant + low power: possible “lucky” detection; replicate or run longer if the decision is high impact.
- Huge uplift on tiny baseline: check absolute differences and confidence; relative metrics can overdramatize.
A/B testing is one input to product judgment. Pair it with:
- qualitative insights (why it happened),
- segmentation (who it helped/hurt),
- guardrail metrics (did it break anything).
Quick checklist before you share results
✅ No SRM warning (or you expected the split)
✅ Conversions aren’t extremely low
✅ Hypothesis (one- vs two-sided) matches the decision you’d actually make
✅ You’re looking at both significance and power
✅ You can explain the change in plain language (“what did users do differently, and why?”)
Stay in the loop
Get updates on UXR Institute course launches, free workshops, and special events.
Thank you!

