Implementing New Functionality

Implementation planning

A/B testing or Split testing is a technique for hypothesis testing that helps to monitor the impact of service or product changes on the users. The population is split into control group (A) and treatment group (B). Group A uses the regular service with no changes. Group B uses the new version, the one we need to test.

The experiment lasts for a fixed period of time (e.g. — two weeks). If the key metric in the treatment group improves compared to the control group, then the new functionality will be implemented.

Before the A/B test, often an A/A test or validity check is used.

A/B test duration

A/B testing suffers from the peeking problem. The overall result is distorted when new data is added at the beginning of the experiment.

For a small number of observations, the dispersion tends to be greater. If the sample is too small, the differences are easy to see. For a statistical test, it is a $p$ -value decrease down to the values small enough to reject the null hypothesis of no difference.

To address the peeking problem, sample size must be set before the start of the test.

Here's the correct procedure of A/B testing:

And here is how you should not conduct an A/B test:

The easiest way to calculate the sample size is to use an online calculator like this one.

Comparing the means

Measurement results and mean values contain an element of randomness. Therefore, they have a random error component. We can not predict each observation's exact value with precise accuracy but we may estimate them using statistical methods

Suppose our null hypothesis $H_0$ states that new functionality does not improve metrics. Then our corresponding hypothesis $H_1$ will be new functionality improves metrics.

At the stage of hypothesis testing, two types of errors are possible:

Type I error — when the null hypothesis is correct, but it is rejected (false positive result; new functionality is approved hence positive)
Type II error — when the null hypothesis is wrong, but it is accepted (false negative result)

\begin{array}{c c} & \text{True Hypothesis} \\ \text{Result of applying}\atop\text{a criterion} & \begin{array}{|c|c|c|} \hline & H_0 & H_1 \\ \hline \text{failed to} & \text{ Correct} & \text{Incorrect conclusion} \\ \text{reject } H_0 & \text{conclusion} & \text{(Type II error)} \\ \hline \text{rejected}& \text{ Incorrect conclusion} & \text{ Correct} \\ H_0 & \text{(Type I error)} & \text{conclusion} \\ \hline \end{array} \end{array}

To accept or reject the null hypothesis, let's calculate the significance level, also known as the $p$ -value (probability value).

Note that if $p$ -value is greater than the threshold value, then the null hypothesis should not be rejected. If it is less than the threshold, the the null hypothesis may be not worth accepting.

The mean values are compared using the methods for testing one-sided hypotheses.

If the data distribution is close to normal, the standard $t$ -test is used to compare the means.

Confidence interval

A confidence interval is a segment of the number axis, which the population parameter of interest falls into with a predetermined probability. If the value falls into the range from 300 to 500 with 99% probability, then the 99-percent confidence interval for this value is (300, 500).

When calculating the confidence interval, we usually drop the same portion of extreme values from each of its ends.

The confidence interval is not just a range of random values. The confidence interval measures the confidence of such estimate.

Calculating the confidence interval

We can build a confidence interval for the mean based on the sample using the central limit theorem.

\mu = \text{population mean}

\sigma^2 = \text{population variance}

Denote the sample mean:

\bar{X} = \text{sample mean}

The central limit theorem says that all means of all possible samples of size $n$ are normally distributed around the true mean of the population. Variance will be equal to the population variance divided by $n$ (the sample size).

\bar{X} \sim \mathbf{N}\left(\mu, \frac{\sigma^{2}}{n}\right)

The standard deviation of this distribution is called the standard error (standard error of mean, or SEM):

\mathrm{SEM}(\bar X) = \frac{\sigma}{\sqrt{n}}

The larger the sample size, the smaller the standard error. The larger our sample, the more accurate the estimate.

\frac{\bar{X} - \mu}{\mathrm{SEM}(\bar X)} \sim \mathbf{N}(0, 1^{2})

From the standard normal distribution, take the 5% percentile $F(0.05)$ and the 95% percentile $F(0.95)$ to obtain the 90% confidence interval:

P\left(F(0.05) < \frac{\bar{X} - \mu}{\mathrm{SEM}(\bar X)} < F(0.95)\right) = 90\%

Reexpress:

P\left(\bar{X} - F(0.05) \cdot \mathrm{SEM}(\bar X) < \mu < \bar{X} + F(0.95) \cdot \mathrm{SEM}(\bar X)\right) = 90\%

To calculate the standard error we use the population variance, which we estimate from the sample.

If the variance is unknown, we can't use normal distribution and have to describe it with the Student distribution. By putting the 5% percentile t(0.05) and the 95% percentile t(0.95) into the formula, we obtain:

P\left(\bar{X} - t(0.05) \cdot \mathrm{SEM}(\bar X) < \mu < \bar{X} + t(0.95) \cdot \mathrm{SEM}(\bar X)\right) = 90\%

The calculation can be simplified by using the Student distribution scipy.stats.t. It has a function for the confidence interval, interval().

alpha — level of significance
df — number of degrees of freedom (equal to $n - 1$ )
loc — average distribution equal to the mean estimate. For the sample, it is calculated as follows: sample.mean()
scale — standard error of distribution equal to the standard error estimate. It is calculated as follows: sample.sem().


1import pandas as pd
2from scipy import stats as st
3
4confidence_interval = st.t.interval(
5  alpha,
6  len(sample)-1,
7  loc=sample.mean(),
8  scale=sample.sem()
9)

Bootstrap

Complex values can be calculated with the help of the Bootstrap technique.

To obtain the desired value we can obtain the subsamples from the source set of data. Then we calculate the mean from each of them.

Bootstrap is applicable to any samples. It is useful when:

Observations are not described by normal distribution;
There are no statistical tests for the target value.

Bootstrap for confidence interval

Let's find out how to form subsamples for bootstrap. You're already familiar with the sample() function. For this task we need to call it in a loop. But here we hit a problem:


1for i in range(5):
2    # extract one random element from sample 1
3    # specify random_state for reproducibility
4    print(data.sample(1, random_state=54321))
5

Since we specify the random_state, the random element is always the same. To address that, create a RandomState() instance from the numpy.random module:


1from numpy.random import RandomState
2state = RandomState(54321)
3

This instance can be passed to the random_state argument of any function. It is important that with each new call, its state will change to random. So we get different subsamples:


1for i in range(5):
2    # extract one random element from sample 1
3    print(data.sample(1, random_state=state))
4

Another important detail when creating subsamples is that they should provide a selection of elements with replacement. To do this, specify replace=True for the sample() function.


1example_data = pd.Series([1, 2, 3, 4, 5])
2print("Without replacement")
3print(example_data.sample(frac=1, replace=False, random_state=state))
4print("With replacement")
5print(example_data.sample(frac=1, replace=True, random_state=state))
6

Bootstrap for A/B test analysis

Bootstrap is also used to analyze the results of A/B testing.

We calculate the actual difference of target parameters between the groups. Then form and test the hypotheses. The null hypothesis is that there is no difference between the target parameters of both groups. The alternative hypothesis is that in the treatment group, the target parameter value is larger.

Now, let's find the probability that such a difference was obtained by accident (this will be our $p$ -value).

Create many subsamples and divide each subsample in two with the index $i$ :

A_i \space\text{--}\space \text{first half of the sample}

B_i \space\text{--}\space \text{second half of the sample}

Find the difference of average purchase amount between them:

In bootstrap, let's assess the share of average purchase amount differences that turned out to be no less that the average purchase amount differences between the original samples:

p\text{-value} = P(D_i \geq D)


1import pandas as pd
2import numpy as np
3
4# actual difference between the means in the groups
5AB_difference = samples_B.mean() - samples_A.mean()
6
7alpha = 0.05
8
9state = np.random.RandomState(54321)
10
11bootstrap_samples = 1000
12count = 0
13for i in range(bootstrap_samples):
14    # calculate how many times the difference
15    # between the means will exceed
16    # the actual value, provided that the null hypothesis is true
17    united_samples = pd.concat([samples_A, samples_B])
18    subsample = united_samples.sample(
19      frac=1,
20      replace=True,
21      random_state=state
22    )
23
24    subsample_A = subsample[:len(samples_A)]
25    subsample_B = subsample[len(samples_A):]
26    bootstrap_difference = subsample_B.mean() - subsample_A.mean()
27
28    if bootstrap_difference >= AB_difference:
29        count += 1
30
31pvalue = 1. * count / bootstrap_samples
32print('p-value =', pvalue)
33
34if pvalue < alpha:
35    print("Reject null hypothesis: average purchase amount is likely to increase")
36else:
37    print("Failed to reject null hypothesis: average purchase amount is unlikely to increase")

Bootstrap for models

Bootstrap can be used to assess confidence intervals in ML models.