A/B Testing

Overview
Policy and Ethics for Experiments
Choosing and Characterizing Metrics
Designing of Experiment
Analyzing Results
Resources

Overview

Choose a metric
Review statistics
Design
Analyze

0 - Definition of A/B testing

A general methodology used online when you want to test a new product or a feature using control/experiment sets of users.

1 - A/B testing can do

A/B testing can do a wide variety of things such as

new features;
additions to UI;
different look of website;
ranking changes;
change backend loading time;
test layout of initial page.

2 - A/B testing cannot do

new experience (change aversion and novelty effects). Two issues of new experience testing: 1). what is the baseline for comparison? 2). How much is needed for users to get adapted to the new experience (plateaued experience);
A/B testing cannot tell you if you’re missing something;
will a change increase repeat customers or referrals for cars selling website/apartment rental website? It takes too long and don’t have data.

3 - Other complimentary techniques to A/B testing

Logs of user behaviors can be used retrospectively or observationally if hypothesis testing can be developed on what’s causing changes in user behavior.
User experience research, focus groups, surveys, human evaluation and so on could give deep and qualitative data.

4 - History of A/B testing

came from field such as agriculture: divide fields into several sections for experiment.
clinical trials in medicine to test.
What’s different for online A/B testing: more data, lower resolution. The goal is to design an experiment that is robust and reproducible results.
online A/B testings usually have more data with lower resolution.

5 - A business example

A typical user flow (customer funnel): Homepage visits –> Explore the site –> Create account –> Completes.
A simple experiment: change the color of a button of an online course website “Audacity”. Hypothesis: will changing the button color from orange to pink increase number of users to explore the courses of the website?
Metric
- Click through rate (CTR): (# number of clicks) / (# of pageviews), we want to use rate when usability of the site.
- Click through probability: (# unique visitors who click) / (# unique visitors to page). (We are interested in whether people are interested in progressing to the second level by clicking the button, so a probability instead of a rate is more proper.). We want to use a probability when measuring the total impact.
Review statistics
- We expect the clickthrough probability follows binomial distribution. Instead of using total number of success (clicks) as the random variable, we use the proportion.

Binomial distributions have assumptions 1). exactly 2 types of outcomes (success or failure) 2). independent events 3). identical distribution (probability p is the same for all).

Use the standard error from the binomial distribution to estimate how variable we expect the overall probability of a clickthrough to be (If X follows B(n, p) and if n is large and/or p is close to 0.5, then X is approximately N(np, npq)). Construct, for example, a 95% confidence interval (if repeat the experiment multiple times, we would expect the intervals we construct around the sample mean to cover the true value 95% of the time).
Choice of two- / one- tailed tests depends on what action you will take based on the results. If you’re going to launch the experiment for a statistically significant positive change, and otherwise not, then you don’t need to distinguish between a negative result and no result, so a one-tailed test is good enough. If you want to learn the direction of the difference, then a two-tailed test is necessary.
Analyze results
- Hypothesis testing or inference tests how likely your results are due to chance. Null hypothesis could be there is no difference in clickthrough rate: $H_0:\hat{d}=\hat{p}_e-\hat{p}_c$. Alternative hypothesis could be one is higher/lower/different than the other.
- Pooled standard error can be used to give a good comparison of both. Suppose $x_c$ and $x_e$ are number users who click for control and experiment groups, respectively. $N_c$ and $N_e$ are the corresponding totalnumber of users. The pooled probability is then, $\hat{p}=\frac{x_c+x_e}{N_c+N_e}$ and pooled standard error of $\hat{d}$ is $$s_p=\sqrt{\hat{p}\cdot(1-\hat{p})\cdot(\frac{1}{N_c}+\frac{1}{N_e})}.$$
- $\hat{d}\sim N(0, s_p)$, if $|\hat{d}| > 1.96 \cdot s_p$, reject $H_0$.
- From a business perspective, what change in the clickthrough probability it practically significant (substantive). Whether it’s practically significant or not depends on the specific situations, for example, the amount of investment needed for the change.
- When the difference doesn’t satisfy standards to be both statistically and practically significant, one could do additional tests to have more confidence to draw a conclusion. However, when there is no enough time to do so, one could communicate with decision makers, others factors such as strategic business issues could be useful as well.
Design experiment
- The smaller the change that to be detected, the higher confidence interval level to the change, the larger the experiment needed.
- $$\alpha=p(reject \, H_0|H_0 \,is\, true)$$ $$\beta=p(fail\,to\,reject\,H_0|H_1\,is\,true)$$ $1-\beta$ is the statistical power. Bigger sample size could increase $ 1- \beta$ (power) while without increasing $\alpha$.

Policy and Ethics for Experiments

This lesson introduces the principals and quesions that you should think about when designing, running and analyzing experiment.

Main principals

Risk: what risk is the participant undertaking?
Benefits: what benefits might result from the study?
Alternatives: what other choices do participants have?
Data sensitivity: what dat is being collected, and what is the expectation of privacy and confidentiality?

Choosing and Characterizing Metrics

Define metrics

Think about what you are going to use this metric for before how you decide how to define.
Different types of metrics, for example, metric can be used for invariant checking, evaluation, for different purposes.
One can create a composite metric when there are mulitple metrics (an objective function or an OEC (objective evaluation criterion), a weighted function to combined all different metrics). Or work on individual metric separately for more details on how the metrics moves under different situations.
Examples of choosing metric for tests of the Audacity website, 1). probability of progressing from course list to course page for effects of updating a description on course list page; 2). click-through-rate for increasing size of a button to test the easiness to find the button.
Other metrics may include mean, median, quantiles …
Difficulty in metrics: 1). don’t have access to data; 2). take too long to collect data.

Other techniques

Surveys, retrospective analysis, focus groups, user experience research … See instructor’s notes for more details.

Build intuition of metrics

1). Define what data to be used to compute the metric. Need to filter data to remove robots/scams? (checking trends over time / country/ year of year difference and so on can help to identify if the trends are normal or not) 2) How to summarize the metric? average? median?…
De-bias the data by filtering external (malicious, fraudulent visits …) and internal (change only affect the traffic of a subset of users, for example, change may only apply to english versions of website) reasons.
Make sure that don’t introduce bias when filtering. For example, if a metric only uses data from logged-in users, there may be a bias as there may be new users who haven’t created an account.
Slice the data (for example, by country, language, week …) to examine that if the data are biased or not. Build the intuition of what changes to expect.

Characterize

Summary metrics (summary of direct measurements such as number of clicks/pageviews)
- Sums and counts (e.g., number of users who visited pages.)
- Means, medians and percentiles (e.g., mean of age of users who completed a course, median latency of pageload)
- Probabilities and rates
- Ratio (can be any number) (e.g., ratio of revnue from clicks to number of all clicks)
Sensitivity and robustness of metrics

We want the metric to be sensitive enough to detect the changes, while also be robust.

Run experiments or use experiments that you already have to test for the sensitivity and robustness, use A/A test to see if it’s too sensitive, look back past experiments that were run earlier.
Retrospective analysis when no experiments can be run. For example, when choosing a metric for latency of videos, de-segment the data into different categories (check distribution of latency of each video), analyze and compare different metrics.
Variability: to obtain a confidence interval, one needs 1) variance/standard deviation and 2) a distribution. Analytical results for confidence interval may be easier to be found for some metrics such as mean, probability, counts, however, for metrics such as median, ratio, their distribution and variance could be more complicated to compute, one needs to look at the underlying data to decide a further step.
Non-parametric methods don’t specify assumptions on the distributions and can also be useful. For example, sign test, the test can help to see how likely the change in metric will occur but cannot tell the size of the effect.
Empirical variability can be an option when the distribution is complicated. A/A tests across the board could be used to estimated the empirical variability of the metircs. Another option is to run a bigger A/A test and use bootstrap when it’s not easy to run a lot tests.
A/A test can do: 1) Compare results to what you expect (sanity check); 2) Estimate variance and calculate confidence intervals (with an assumed distribution of the metic); 3) Directly estimate empirical confidence interval without assuming a distribution of the metric (run multiple tests and take percentiles of the metric values).

Designing of Experiment

Choose “subject”

How to assign events to either the control or experiment group. Randomly assign? not suitable for user visible changes as one user may see the change randomly while reloading the page. User ID? one user may have multiple ID. Cookie? Cookie will be different for multiple devices of one user.

Commonly used unit of diversion:

1- User ID: stable, unchanging, personally identifiably.

2- Anonymous ID (cookie): changes when you switch browers or device, users can clear cookies.

3- Event: users may not have consistent experience, better used for non-user-visible changes.

Less commonly used unit of diversion:

4- Device ID: only available for mobile, tied to specific device, unchanged by user, personally identifiable.

5- IP address: changes when location changes.

Consistency of diversion: for user visible changes, a cookie or user ID can ensure that a user to have consitent experience and is better when a learning effect of the change of users is of interest. It’s probably ok to use event based unit of checking the rankings of the search results, loading time when learning effect is not a concern.
Ethical consideration of diversion: informed consent, security and confidentiality questions.
Variability: unit of analysis (the donominator of the metric if applicable) could be different with unit of diversion. The empirical variability of the metric may then not be consistent with the analytical variance if the unit of analysis is different with unit of diversion.

Choose “population”

Inter- and intra-user experiments: A/B testing is mainly using inter-user experiments (i.e., different people on the A and B side). Intra-user experiments will use the same group of people for both A and B side. For example, to test the reordering of a list, interleaved ranking experiment can be an option. See more details at http://www.cs.cornell.edu/people/tj/publications/chapelle_etal_12a.pdf.
Target population: only run the experiment on the affected traffic. Possible factors include browers, languages, geo location…
Cohort: users who enter the experiment around the same time within a population. Cohort is harder to analyze as it needs more data with losing users. Cohort is useful when you’re looking for user stability (learning effects, user retention, increased user activity or anything that requires users to be established). Cohort can also affect the variability of the metric.

Size

The choice of size depends on $\alpha$, $\beta$, practical significance level, unit of diversion, target population… See here for the sample code on how to compute the desired size.

Duration and Exposure

Translate idea population size to practical decisions. What’s the duration and when to run the experiment?
There may be weekly/monthly variation in traffic and metric, one may run a mix of weekdays and weekends. For a risky change (for example, changes may have bugs), one may run longer in a smaller population proportion/ less percentage of daily traffic.
Learning effect: whether a user is adapting to a change or not. The key measure is time, a stateful unit of diversion (eg. cookie or user ID), dosage (how often a user will see this change.

Analyzing Results

Running A/B tests is an iterative process.

Sanity checks

Check population sizing metrics based on unit of diversion. For example, if the experiment and control populations are actually comparable, the number of users, cookies, events in the two groups are comparable.
Check actual invariant metrics that you don’t expect a change when you run your experiment and make sure that they don’t change.
How to figure out whether this difference is within expectations? “Each invariant metrics(eg., cookies, users, # of events) is randomly assigned to control/experiment group with probability 0.5. —> Compute confidence interval of probability around 0.5, use the standard deviation $sd = \sqrt{(0.5 \cdot 0.5)/N_{total}}$. —> Check the observed fraction of two groups is within the interval or not.”
Make sure the you pass the sanity check before proceeding. (Debug, retrospective analysis, check pre- and post-period of experiments). Common reasons of failed sanity check include data capture, experiment set-up, infrastructure and so on.

Single metric

Goal: make a business decision on whether your experiment has favorably impacted your metrics (if there is a statistically significant of the result).
Analyze the results in more details to make sure that the differences between control and experiment groups are significant (statistical significant, practical significant, significant on sign test for sliced data).
Simpson’s paradox: a phenomenon in probability and statistics, in which a trend appears in several different groups of data but disappears or reverses when these groups are combined. (Wiki). In this case, we should dig deeper to analyze the reason first(experiment set-up, different changes in different subgroups…)

Multiple metrics

The more tests you do at the same time, the more likely you are to see significant difference by chance.

The significance shouldn’t be repeatable if it happens by chance. Repeat the experiment/boostrap analysis/do experiment on slices will reveal more information on if it’s happening by chance.
Multiple comparisons adjust significant levels to accout for how many metrics/different tests involved. See https://en.wikipedia.org/wiki/Multiple_comparisons_problem for more details.
Solutions to avoid false positive:
1) use higher confidence interval for each metric $\alpha_{total} = 1 - (1-\alpha)^n$ (assume tests are independent).
2) Use Bonferroni correction, $\alpha = \alpha_{total}/n$ (simple and no assumptions, but sometimes too conservative due to possible correlations among tests/metrics).
3) In practice, it may come down to judegement call, possible based on business strategy.
4) Control familywise error rate (FWER) $\alpha_{overall}$, the probability that any metric shows a false positive.
5) Control false discovery rate (FDR), $E[\frac{no. \,false \,positive}{no. \,rejections}]$. It only makes sense when having a huge number of metrics.
More information can be found https://en.wikipedia.org/wiki/Closed_testing_procedure, https://en.wikipedia.org/wiki/Boole%27s_inequality, https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method and https://en.wikipedia.org/wiki/False_discovery_rate.
Draw a conclusion from analyzing the results. Questions to ask: do you understand the change? do you want to launch the change?

Gotchas: changes over time

Ramp-up experiment: start with a small percent of population for changes and ramp up.
The effect in a smaller percent of population may flatten out as you ramp up the change. For example, seasonal or event-driven impact such as holidays. One strategy is to use a holdback: launch the change to everyone except for a small holdback that don’t get the changes and continue comparing their behavior to the control.

Resources

_{^{Revision history:}}

_{^{2021-04-06: Add a/b testing calculator resources at the end}}

_{^{2019-08-25: First version}}

Table of contents