Chapter Summary: Analyzing the Results of A/B Tests
Testing the Hypothesis that Proportions Are Equal
A typical task in statistics is to test hypotheses about the equality of proportions of populations. As with the mean, sample proportions will be normally distributed around the actual one.
The difference between the proportions we observe in our samples will be our statistic. Here's how you can prove that it's distributed normally:
is the standard value for a criterion with a standard normal distribution, where the mean is 0 and the standard deviation is 1. All this is stated in the right part of the formula after the sign, which says the expression is distributed as .
and represent the sizes of the two samples being compared. and are the proportions observed in the samples, and is the proportion in the sample made up of and . and are the actual proportions in the populations we're comparing.
With A/B testing, one usually tests the hypothesis that . Then, if the null hypothesis is true, the expression in the nominator will equal 0 and the criterion can be calculated using only the sample data.
Statistics obtained this way will be normally distributed, making it possible to carry out two-sided and one-sided (bilateral and unilateral) tests. Using the same null hypothesis that two populations' proportions are equal, we can test the alternative hypotheses that either the proportions simply aren't equal, or that one proportion is larger or smaller than the other.
1from scipy import stats as st2import numpy as np3import math as mth45alpha = .05 # significance level67successes = np.array([78, 120])8trials = np.array([830, 909])910# success proportion in the first group:11p1 = successes[0]/trials[0]1213# success proportion in the second group:14p2 = successes[1]/trials[1]1516# success proportion in the combined dataset:17p_combined = (successes[0] + successes[1]) / (trials[0] + trials[1])1819# the difference between the datasets' proportions20difference = p1 - p2
Let's calculate the statistic in terms of standard deviations of the standard normal distribution:
1# calculating the statistic in standard deviations of the standard normal distribution2z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials[0] + 1/trials[1]))34# setting up the standard normal distribution (mean 0, standard deviation 1)5distr = st.norm(0, 1)
If the proportions were equal, the difference between them would be 0. Let's calculate how far from 0 our statistic turned out to be.
Since the statistic's distribution is normal, we'll call the cdf()
method.
We'll get the absolute value of the statistic using the abs()
method.
1# calculating the statistic in standard deviations of the standard normal distribution2z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials[0] + 1/trials[1]))34# setting up the standard normal distribution (mean 0, standard deviation 1)5distr = st.norm(0, 1)67p_value = (1 - distr.cdf(abs(z_value))) * 289print('p-value: ', p_value)1011if (p_value < alpha):12 print("Rejecting the null hypothesis: there is a significant difference between the proportions")13else:14 print("Failed to reject the null hypothesis: there is no reason to consider the proportions different")15
Normality Tests. The Shapiro–Wilk Test
In real life, many variables diverge from the normal distribution because they contain outliers that cannot be ignored. To test whether datasets are accurately modeled by the normal distribution we use normality tests.
According to the central limit theorem, sample means are normally distributed around the population's true mean. This is also true for distributions containing major outliers. So, each individual sample could contain outliers that will shift the results and affects the results.
First you need to test the hypothesis that a sample was taken from a normally distributed population.
One way to do this is (chi-squared). The sum of squared differences between observed and expected values is divided by the expected values:
in this formula is observed values, while is expected ones. It's assumed that the difference between the expected and observed values is normally distributed around 0, so that the probability of deviations decreases as you move in either direction away from the expected values. So, this criterion will be distributed as the sum of squared standard normal distributions, where is the number of observations in a sample. This is chi-squared distribution.
The Shapiro–Wilk test is better for normality tests.
It is significantly more complex, and it's easier to test its high power on a number of datasets than to prove it theoretically.
The calculation of the Shapiro-Wilk criterion is built into the standard scipy.stats
module. Let's see how it works in practice.
sample_1
stores data on the weekly numbers of user sessions on a website over a year. Let's use the st.shapiro(x)
method to test whether the variable can be considered normally distributed:
1from scipy import stats as st23alpha = .05 # significance level45results = st.shapiro(sample_1)6p_value = results[1] # the second value in the array of results (with index 1) - the p-value78print('p-value: ', p_value)910if (p_value < alpha):11 print("Null hypothesis rejected: the distribution is not normal")12else:13 print("Failed to reject the null hypothesis: the distribution seems to be normal")
The Wilcoxon-Mann-Whitney Nonparametric Test
When your data contains substantial outliers, algebraic metrics don't work very well. They take every value into account so one outlying value can throw everything off.
Algebraic criteria for testing hypotheses about the normality of the original data are parametric, meaning you use a sample to evaluate the parameters of the expected distribution (e.g., the mean). So let's now look at a test based on a structural approach, or a nonparametric test.
You can also use a structural approach, or a nonparametric test. The method that we'll use for A/B testing is called st.mannwhitneyu()
(the Mann-Whitney U test).
The key idea behind the test is to rank two samples in ascending order and compare the ranks of values that appear in both samples. If the differences between their ranks are the same from sample to sample, this means the shift is typical. That means some values were simply added, causing the rest to shift.
Nontypical shifts mean that a real change occurred. The sum of such shifts in rank (from #1 to #4 would be 3, etc.) is the value of the criterion. The higher it is, the higher the probability that the distributions of the two samples differ.
The probabilities of getting various values from a Mann-Whitney test have been calculated theoretically, which makes it possible for us to conclude there is or isn't a difference for whatever significance level was set.
Nonparametric methods are useful because they do not make assumptions about how the data is distributed. Such methods are often used when it's difficult (or even impossible) to estimate parameters because of a large number of outliers.
1alpha = .05 #significance level23results = st.mannwhitneyu(messages_old, messages_new)45print('p-value: ', results.pvalue)67if (results.pvalue < alpha):8 print("Null hypothesis rejected: the difference is statistically significant")9else:10 print("Failed to reject the null hypothesis: we can't make conclusions about the difference")
Stability of Cumulative Metrics
In order to prevent the peeking problem, analysts examine graphs.
They analyze graphs of cumulative metrics. If you plot a graph with cumulative data, at the first-day point you'll have metric values for that day, at the second-day point you'll have the sum of metrics for the first two days, and so on.
According to the central limit theorem, the values of cumulative metrics often converge and settle around a particular mean.
To make differences between groups more obvious, analysts plot relative difference graphs. Each of its points is calculated as follows: group B cumulative metric / group A cumulative metric - 1
.
Another function that's quite useful for calculating cumulative metrics is np.logical_and(). It allows us to apply boolean operations to Series objects.
1np.logical_and(first_condition, second_condition)2np.logical_or(first_condition, second_condition)3np.logical_not(first_condition)
Let's say we have two DataFrames, orders
and visitors
, which contain data on an online store
1# getting aggregated cumulative daily data on orders2ordersAggregated = datesGroups.apply(lambda x: orders[np.logical_and(orders['date'] <= x['date'], orders['group'] == x['group'])].agg({'date' : 'max', 'group' : 'max', 'orderId' : pd.Series.nunique, 'userId' : pd.Series.nunique, 'revenue' : 'sum'}), axis=1).sort_values(by=['date','group'])34# getting aggregated cumulative daily data on visitors5visitorsAggregated = datesGroups.apply(lambda x: visitors[np.logical_and(visitors['date'] <= x['date'], visitors['group'] == x['group'])].agg({'date' : 'max', 'group' : 'max', 'visitors' : 'sum'}), axis=1).sort_values(by=['date','group'])67# merging the two tables into one and giving its columns descriptive names8cumlativeData = ordersAggregated.merge(visitorsAggregated, left_on=['date', 'group'], right_on=['date', 'group'])9cumulativeData.columns = ['date', 'group', 'orders', 'buyers', 'revenue', 'visitors']
Let's plot cumulative revenue graphs by day and A/B test group:
1import matplotlib.pyplot as plt23# DataFrame with cumulative orders and cumulative revenue by day, group A4cumulativeRevenueA = cumulativeData[cumulativeData['group']=='A'][['date','revenue', 'orders']]56# DataFrame with cumulative orders and cumulative revenue by day, group B7cumulativeRevenueB = cumulativeData[cumulativeData['group']=='B'][['date','revenue', 'orders']]89# Plotting the group A revenue graph10plt.plot(cumulativeRevenueA['date'], cumulativeRevenueA['revenue'], label='A')1112# Plotting the group B revenue graph13plt.plot(cumulativeRevenueB['date'], cumulativeRevenueB['revenue'], label='B')1415plt.legend()
Now let's plot average purchase size by group. We'll divide cumulative revenue by the cumulative number of orders:
1plt.plot(cumulativeRevenueA['date'], cumulativeRevenueA['revenue']/cumulativeRevenueA['orders'], label='A')2plt.plot(cumulativeRevenueB['date'], cumulativeRevenueB['revenue']/cumulativeRevenueB['orders'], label='B')3plt.legend()
Let's plot a relative difference graph for the average purchase sizes. We'll add a horizontal axis with the axhline() method (i.e. horizontal line across the axis):
1# gathering the data into one DataFrame2mergedCumulativeRevenue = cumulativeRevenueA.merge(cumulativeRevenueB, left_on='date', right_on='date', how='left', suffixes=['A', 'B'])34# plotting a relative difference graph for the average purchase sizes5plt.plot(mergedCumulativeRevenue['date'], (mergedCumulativeRevenue['revenueB']/mergedCumulativeRevenue['ordersB'])/(mergedCumulativeRevenue['revenueA']/mergedCumulativeRevenue['ordersA'])-1)67# adding the X axis8plt.axhline(y=0, color='black', linestyle='--')
Analyzing Outliers and Surges: Extreme Values
One problem you may encounter during A/B test analysis is outliers/anomalies. An anomaly is a value that appears rarely in a statistical population but can introduce error when it does.
Distribution histograms and diagrams are quite helpful when it comes to analyzing anomalies.
If we divide an ordered set into 100 parts instead of four, we'll get percentiles. These work on the same principle as quartiles: the th percentile marks the value greater than percent of values in the sample. The chance that a random value is smaller than the th percentile is percent.
To calculate percentiles, you'll need the percentile()
method from numpy
:
1# values - the range of values2# percentiles - the array of percentiles to be calculated34import numpy as np5np.percentile(values, percentiles)
To detect anomalies, you'll need to analyze the 95, 97.5, and 99 percentiles.
Common Mistakes in A/B Test Analysis
Splitting test traffic incorrectly
This is a common problem when users get incorrectly divided into segments.
For instance, you can't consider traffic division correct if the A group uses the mobile version of a site while the B group uses the desktop version.
Having groups of different sizes distorts the results, too.
Ignoring statistical significance
Decisions on differences in test results are often taken exclusively on the basis of relative change.
The peeking problem
Checking the data and taking action before the A/B test is over should be avoided. Analysts are often asked to make decisions based on interim results. Don't give in!
The sample is too small
A small sample size means the impact of each individual observation will be too strong, so there will be neither significance nor precision.
The test was too short
You launched a test, got your sample, and made a decision, but the test results varied a lot in a real life conditions.
You should only make decisions when you're sure of the results.
The test was too long
You shouldn't go to the other extreme, either. Sometimes the results have not yet stabilized, they fluctuate, but continuing the experiment won't have any impact on your decision.
Failing to analyze anomalies
Never forget about anomalies, and keep them in mind when analyzing results.
Neglecting to correct statistical significance with multiple comparisons
The more groups you have in your test, the more often you'll get a false positive result for at least one comparison.