Chapter Summary: Preparing for an A/B Test

AA Testing

Before starting an A/B test, we run an A/A test to make sure that:

The results are not affected by anomalies or outliers in the statistical population
The tool for splitting traffic works correctly
Data is sent to analytical systems correctly

A/A tests are similar to A/B tests, but in this case each group is shown the same version of a page. A/A tests also help determine how long the A/B test should last and the method for analyzing the results.

Here are the criteria for a successful A/A test:

The number of users in different groups doesn't vary by more than 1%
For all groups, data on the same event is recorded and sent to analytical systems
None of the key metrics varies by a statistically significant amount—typically by no more than 1%
Users remain within their groups until the end of the test. If they see different versions of the page during the study, it won't be clear which version influenced their decisions, so the reliability of the results will be compromised

The extent to which the key metrics differ among groups depends on how sensitive the experiments needs to be.

Type I and Type II Errors in Testing Hypotheses. Power and Significance

A type I error is a false positive result. Here there is no difference between the groups being compared, but the test yielded a $p$ -value lower than the significance level. As a result, there are grounds for rejecting $H_0$ . Thus, the probability of making a type I error is equal to the significance level, $\alpha$ .

A type II error is a false negative result. This means there is difference between the groups, but the test yielded a $p$ -value greater than $\alpha$ , so there's no reason to reject $H_0$ . If we call the probability of making a type II error $\beta$ , then $1-\beta$ will be the hypothesis test's statistical power. If $\beta$ is the probability of making a mistake, $1-\beta$ is a probability of not making it, or rejecting the null hypothesis correctly, when it's false.

\begin{array}{c c} & \text{True Hypothesis} \\ \text{Result of applying}\atop\text{a criterion} & \begin{array}{|c|c|c|} \hline & H_0 & H_1 \\ \hline \text{failed to} & \text{ Correct} & \text{Incorrect conclusion} \\ \text{reject } H_0 & \text{conclusion} & \text{(Type II error)} \\ \hline \text{rejected}& \text{ Incorrect conclusion} & \text{ Correct} \\ H_0 & \text{(Type I error)} & \text{conclusion} \\ \hline \end{array} \end{array}

Multiple Comparisons: A/B and A/B/n tests

Making several comparisons with the same data is called multiple testing. It's important to note that the probability of making a type I error increases with each new hypothesis test.

If the probability of making a mistake is $\alpha$ each time, then the probability of not making any mistakes is $1-\alpha$ . So the probability of not making any mistakes in the course of $k$ comparisons will be:

(1-\alpha)^k

The probability of making at least one mistake in the course of $k$ comparisons will be:

1-(1-\alpha)^k

In order to decrease the probability of false positive results in multiple comparisons, experts have various methods of correcting the significance level, which helps lower the family-wise error rate (FWER).

The Bonferroni procedure (the Bonferroni correction):
$\alpha_1=\dots=\alpha_m=\frac{\alpha}{m}$
The Holm method
$\alpha_1=\frac{\alpha}{m}, \alpha_2=\frac{\alpha}{m-1}, \dots, \alpha_i=\frac{\alpha}{m-i+1}, \dots, \alpha_m = \alpha$
The Šidák method
$\alpha_1 = \alpha_2 = \dots = \alpha_m = 1 - (1-\alpha)^{\frac{1}{m}}$

The Bonferroni correction is the most common because of its simplicity. It's not hard to divide the desired significance level by the number of comparisons, which are made using the same data, without the need to make new observations for each test. If you do collect new data for each hypothesis test, performed in a standard way, selecting the necessary $p$ -value as you did in the part of the course on statistics.

Calculating Sample Size and Test Duration

Analysts have to take into account the conditions in which A/B test samples are generated, including how long the test lasts and whether the peeking problem is relevant.

When deciding on test duration, cyclical changes in traffic (daily, weekly, monthly) and the time it takes a customer to decide to make a purchase are taken into account.

The peeking problem arises when the introduction of new data at the beginning of the test considerably distorts the overall result.

This is a manifestation of the law of big numbers. If there are few observations, their dispersion will be greater. If there are many, random outliers cancel each other out. This means that when the sample is too small, you're more likely to see differences, but they won't be statistically significant. For a statistical test, this will mean decreasing the p-value until it's small enough to reject the null hypothesis.

To make up for the peeking problem, sample size is determined before the test begins.

Here's the correct procedure for A/B testing:

And here's the wrong way to do it:

Calculators of test duration and sample size

One of the simplest ways of determining the sample size you need is to calculate it online.

Here are several calculators:

These services are good for evaluating the minimum required sample size for which a change in metric would be noticeable. This will help calculate the minimum duration of the test.

Graphical Analysis of Metrics and Scope Definition

The pros of calculators:

Simple calculation of conversion
Solution of the peeking problem
The possibility of estimating the minimum test duration

The cons:

Sample size: necessary but often not sufficient for a test to be valid
Calculators cannot account for the fact that in real life conversion and the relative minimum detectable effect never remain the same throughout the test
Calculators only work well for calculating sample size for conversion. There are calculators for other indicators, as well, but they're much more complex.

Determining the timing and minimum duration of a test based on the industry

When determining test duration and timing, you need to know what kinds of surges in activity characterize your audience.

Possible reasons for surges include:

Workdays or weekends
Holidays (increased demand for gifts)
Sales, offers, marketing activities (discounts increase the audience's activity, changing its purchasing behavior)
Special events (e.g. buying back-to-school goods in the fall)
Product seasonality (e.g. heaters)
Competitors' activity (competitors lower the price of a product, so your customers' activity falls);
Changes in the political and economical situation (downturns, inflation, embargoes, rising prices because of additional tariffs).

In addition to surges in activity, you should also be aware of the realization cycle of the metric being measured. More often than not, it's linked with the buying decision process, or the time between the very first thought of buying a product and the actual purchase.