## With an implementation in Python

The online world gives us a big opportunity to perform experiments and scientifically evaluate different ideas. Since these experiments are data-driven and providing no room for instincts or gut feelings, we can establish causal relationships between changes and their influence on user behavior. Leveraging on these experiments, many organizations can understand their customers’ liking and preferences by avoiding the so-called HiPPO effect😅

A/B testing is a common methodology to test new products or new features, especially regarding user interface, marketing and eCommerce. The main principle of an A/B test is to split users into two groups; showing the existing product or feature to the **control group** and the new product or feature to the **experiment group. **Finally, evaluating how users respond differently in two groups and deciding which version is better. Even though A/B testing is a common practice of online businesses, a lot can easily go wrong from setting up the experiment to interpreting the results correctly.

In this article, you will find how to design a robust A/B test that gives you repeatable results, what are the main pitfalls of A/B testing that require additional attention and how to interpret the results.

You can check out the Jupyter Notebook on my GitHub for the full analysis.

Before getting deeper into A/B testing, let’s answer the following questions.

**1. What can be tested?**

Both visible and invisible changes can be tested with A/B testing. Examples to **visible changes** can be new additions to the UI, changes in the design and layout or headline messages. A very popular example is Google’s 41 (yes, not 2) different shades of blue experiment where they randomly showed a shade of blue to each 2.5% of users to understand which color shade earns more clicks. Examples to **invisible changes** can be page load time or testing different recommendation algorithms. A popular example is Amazon’s A/B test that showed every 100ms increase in page load time decreased the sales by 1%.

**2. What can’t be tested?**

New experiences are not suitable for implementing A/B tests. Because a new experience can show **change aversion** behavior where users don’t like changes and prefer to stick to the old version, or it can show **novelty effect**** **where users feel very excited and want to test out everything. In both cases, defining a baseline for comparison and deciding the duration of the test is difficult.

**3. How can we choose the metrics?**

Metric selection needs to consider both sensitivity and robustness. **Sensitivity** means that metrics should be able to catch the changes and **robustness **means that metrics shouldn’t change too much from irrelevant effects. As an example, most of the time if the metric is a “mean”, it is sensitive to outliers but not robust. If the metric is a “median”, it is robust but not sensitive for small group changes.

In order to consider both sensitivity and robustness in the metric selection, we can apply filtering and segmentation while creating the control and experiment samples. Filtering and segmentation can be based on user demographics (i.e. age, gender), the language of the platform, internet browser, device type (i.e. iOS or Android), cohort and etc.

## 4. What is the pipeline?

- Formulate the hypothesis
- Design the experiment
- Collect the data
- Inference/Conclusions

The process of A/B testing starts with a hypothesis. The baseline assumption, or in other words the **null hypothesis**, assumes that the treatments are equal and any difference between the control and experiment groups is *due to chance*. The **alternative hypothesis **assumes that the null hypothesis is wrong and the outcomes of control and experiment groups are more different than what chance might produce. An A/B test is designed to test the hypothesis in such a way that observed difference between the two groups should be either due to random chance or due to a true difference between the groups. After formulating the hypothesis, we collect the data and draw conclusions. **Inference** of the results reflects the intention of applying the conclusions that are drawn from the experiment samples and applicable for the entire population.

## Let’s see an example...

Imagine that you are running a UI experiment where you want to understand the difference between conversion rates of your initial layout vs a new layout. (let’s imagine you want to understand the impact of changing the color of “buy” button from red to blue🔴🔵)

In this experiment, the null hypothesis assumes conversion rates are equal and if there is a difference this is only due to* the chance factor*. In contrast, the alternative hypothesis assumes there is a statistically significant difference between the conversion rates.

Null hypothesis -> Ho : CR_red = CR_blue

Alternative hypothesis -> H1 : CR_red ≠ CR_blue

After formulating the hypothesis and performing the experiment we collected the following data in the contingency table.

The conversion rate of CG is: 150/150+23567 = 0.632%The conversion rate of EG is: 165/165+23230 = 0.692%From these conversion rates, we can calculate the relative uplift between conversion rates: (0.692%-0.632%)/0.632% = 9.50%

As seen in the code snipped above changing the layout increased the conversion rate by 0.06 percentage points. But is it by chance or the success of the color change❔

We can analyze the results in the following two ways:

**1. Applying statistical hypothesis test**

Using statistical significance tests we can measure if the collected data shows a result more extreme than the chance might produce. If the result is beyond the chance variation, then it is statistically significant*.* In this example, we have categorical variables in the contingency data format, which follow a Bernoulli distribution. Bernoulli Distribution has a probability of being 1 and a probability of being 0. In our example, it is conversion=1 and no conversion=0. Considering we are using the conversions as the metric, which is a categorical variable following Bernoulli distribution, we will be using the **Chi-Squared test**** **to interpret the results.

The Chi-Squared test assumes observed frequencies for a categorical variable match with the expected frequencies. It calculates a test statistic (Chi) that has a chi-squared distribution and is interpreted to reject or fail to reject the null hypothesis if the expected and observed frequencies are the same. In this article, we will be using `scipy.stats`

package for the statistical functions.

The probability density function of Chi-Squared distribution varies with the **degrees of freedom (df) **which depends on the size of the contingency table, and calculated as `df=(#rows-1)*(#columns-1)`

In this example df=1.

Key terms we need to know to interpret the test result using Python are **p-value** and **alpha****. **P-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. P-value is is one of the outcomes of the test. Alpha also known as the level of statistical significance is the probability of making **type I error** (rejecting the null hypothesis when it is actually true). The probability of making a **type II error** (failing to reject the null hypothesis when it is actually false) is called beta, but it is out of scope for this article. In general, alpha is taken as 0.05 indicating 5% risk of concluding a difference exists between the groups when there is no actual difference.

In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:

**If p-value <= alpha**: significant result, reject null hypothesis**If p-value > alpha**: not significant result, do not reject null hypothesis

We can also interpret the test result by using the test statistic and the critical value:

**If test statistic >= critical value**: significant result, reject null hypothesis**If test statistic < critical value**: not significant result, do not to reject null hypothesis

### chi2 test on contingency table

print(table)

alpha = 0.05

stat, p, dof, expected = stats.chi2_contingency(table)### interpret p-value

print('significance=%.3f, p=%.3f' % (alpha, p))

if p <= alpha:

print('Reject null hypothesis)')

else:

print('Do not reject null hypothesis')### interpret test-statistic

prob = 1 - alpha

critical = stats.chi2.ppf(prob, dof)

print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))

if abs(stat) >= critical:

print('Reject null hypothesis)')

else:

print('Do not reject null hypothesis')

[[ 150 23717]

[ 165 23395]]

significance=0.050, p=0.365

Do not reject null hypothesis

probability=0.950, critical=3.841, stat=0.822

Do not reject null hypothesis

As can be seen from the result we do not reject the null hypothesis, in other words, the positive relative difference between the conversion rates is not significant.

**2. Performing permutation tests**

The permutation test is one of my favorite techniques because it does not require data to be numeric or binary and sample sizes can be similar or different. Also, assumptions about normally distributed data are not needed.

Permuting means changing the order of a set of values, and what permutation test does is combining results from both groups and testing the null hypothesis by randomly drawing groups (equal to the experiment groups’ sample sizes) from the combined set and analyzing how much they differ from one another. The test repeats doing this as much as decided by the user (say 1000 times). In the end, user should compare the observed difference between experiment and control groups with the set of permuted differences. If the observed difference lies within the set of permuted differences, we do not reject the null hypothesis. But if the observed difference lies outside of the most permutation distribution, we reject the null hypothesis and conclude as the A/B test result is statistically significant and not due to chance.

### Function to perform permutation test

def perm_func(x, nA, nB):

n = nA + nB

id_B = set(random.sample(range(n), nB))

id_A = set(range(n)) — id_B

return x.loc[idx_B].mean() — x.loc[id_A].mean()### Observed difference from experiment

obs_pct_diff = 100 * (150 / 23717–165 / 23395)### Aggregated conversion set

conversion = [0] * 46797

conversion.extend([1] * 315)

conversion = pd.Series(conversion)### Permutation test

perm_diffs = [100 * perm_fun(conversion, 23717, 23395)

for i in range(1000)]### Probability

print(np.mean([diff > obs_pct_diff for diff in perm_diffs]))

0.823

(Video) The A/B Testing Guide For UX Designers

This result shows us around 82% of the time we would expect to reach the experiment result by random chance.

Additionally, we can plot a histogram of differences from the permutation test and highlight where the observed difference lies.

`fig, ax = plt.subplots(figsize=(5, 5))`

ax.hist(perm_diffs, rwidth=0.9)

ax.axvline(x=obs_pct_diff, lw=2)

ax.text(-0.18, 200, ‘Observed\ndifference’, bbox={‘facecolor’:’white’})

ax.set_xlabel(‘Conversion rate (in percentage)’)

ax.set_ylabel(‘Frequency’)

plt.show()

As seen in the plot, the observed difference lies within most of the permuted differences supporting the “do not reject the null hypothesis” result of Chi-Squared test.

## Let’s see another example..

Imagine we are using the average session time as our metric to analyze the result of the A/B test. We aim to understand if the new design of the page gets more attention from the users and increase the time they spend on the page.

The first few rows representing different user ids look like the following:

### Average difference between control and test samples

mean_cont = np.mean(data[data["Page"] == "Old design"]["Time"])

mean_exp = np.mean(data[data["Page"] == "New design"]["Time"])

mean_diff = mean_exp - mean_cont

print(f"Average difference between experiment and control samples is: {mean_diff}")### Boxplots

sns.boxplot(x=data["Page"], y=data["Time"], width=0.4)

Average difference between experiment and control samples is: 22.85

Again we will be analyzing the results in the following two ways:

## 1. Applying statistical hypothesis test

In this example will use t-Test (or Student’s t-Test) because we have numeric data. t-Test is one of the most commonly used statistical tests where the test statistic follows a Student’s *t*-distribution under the null hypothesis. t-distribution is used when estimating the mean of a normally distributed population in situations where the sample size is small and the population standard deviation is unknown.

t-distribution is symmetric and bell-shaped like the normal distribution but has thicker and longer tails, meaning that it is more prone to produce values far from its mean. As seen in the plot, the larger the sample size, the more normally shaped the t-distribution becomes.

In this analysis, we will use `scipy.stats.mstats.ttest_ind`

which calculates t-Test for the means of two independent samples. It is a two-sided test for the null hypothesis that two independent samples have (expected) identical average values. As parameter we must set `equal_var=False`

to perform Welch’s t-test, which does not assume equal population variance between control and experiment samples.

`### t-Test on the data`

test_res = stats.ttest_ind(data[data.Page == "Old design"]["Time"],

data[data.Page == "New design"]["Time"],

equal_var=False)

print(f'p-value for single sided test: {test_res.pvalue / 2:.4f}')

if test_res.pvalue <= alpha:

print('Reject null hypothesis)')

else:

print('Do not reject null hypothesis')

p-value for single sided test: 0.1020

Do not reject null hypothesis

As seen in the result, we do not reject the null hypothesis, meaning that the positive average difference between experiment and control samples is not significant.

**2. Performing permutation tests**

As we did in the previous example, we can perform the permutation test by iterating 1000 times.

nA = data[data.Page == 'Old design'].shape[0]

nB = data[data.Page == 'New design'].shape[0]perm_diffs = [perm_fun(data.Time, nA, nB) for _ in range(1000)]larger=[i for i in perm_diffs if i > mean_exp-mean_cont]

print(len(larger)/len(perm_diffs))

0.102

This result shows us around 10% of the time we would expect to reach the experiment result by random chance.

`fig, ax = plt.subplots(figsize=(8, 6))`

ax.hist(perm_diffs, rwidth=0.9)

ax.axvline(x = mean_exp — mean_cont, color=’black’, lw=2)

ax.text(25, 190, ‘Observed\ndifference’, bbox={‘facecolor’:’white’})

plt.show()

As seen in the plot, the observed difference lies within most of the permuted differences, supporting the “do not reject the null hypothesis” result of t-Test.

## Bonus

- To design a robust experiment, it is highly recommended to decide metrics for
**invariant checking**. These metrics shouldn’t change between control and experiment groups and can be used for sanity checking. - What is important in an A/B test is to define the sample size for the experiment and control groups that represent the overall population. While doing this, we need to pay attention to two things:
**randomness**and**representativeness**. Randomness of the sample is necessary to reach unbiased results and representativeness is necessary to capture all different user behaviors. - Online tools can be used to calculate the required
**minimum sample size**for the experiment. - Before running the experiment, it would be better to decide the
**desired lift value**. Sometimes even the test result is statistically significant, it might not be practically significant. Organizations might not prefer to perform a change if the change is not going to bring a lift as desired. - If you are working with a sample dataset, but you are willing to understand the population behavior you can include
**resampling methods**in your analysis. You can read my article Resampling Methods for Inference Analysis (attached below) to learn more ⚡

I hope you enjoyed reading the article and find it useful!

*If you liked this article, you can** **read my other articles here** **and **follow me on Medium.*** **Let me know if you have any questions or suggestions.✨

**Enjoy this article? ****Become a member for more!**

## FAQs

### How do you analyze AB test results? ›

**Interpreting A/B test results**

- Sample Size.
- Significance level.
- Test duration.
- Number of conversions.
- Analyze external and internal factors.
- Segmenting test results (the type of visitor, traffic, and device)
- Analyzing micro-conversion data.

**What is AB testing design? ›**

A/B testing — **a technique of showing two or more variants of a design to users at random to find out which one performs better** — is just one approach you can use.

**What are the major steps for a typical A B testing? ›**

Setting hypotheses. Getting baseline numbers for metrics. Minimum Detectable Effect (MDE) Power Analysis (To determine the sample size required to conclude an experiment)

**What is a good sample size for AB testing? ›**

To A/B test a sample of your list, you need to have a decently large list size — **at least 1,000 contacts**. If you have fewer than that in your list, the proportion of your list that you need to A/B test to get statistically significant results gets larger and larger.

**What is a B testing examples? ›**

For example, you might **send two versions of an email to your customer list (randomizing the list first, of course) and figure out which one generates more sales**. Then you can just send out the winning version next time. Or you might test two versions of ad copy and see which one converts visitors more often.

**What is AB testing for dummies? ›**

You decide that modifying this page in a certain way might increase its conversions. Let's call this new, modified version B. A/B testing is **presenting two versions of your web page (A and B) to an audience sharing common characteristics at random to find results that support or disprove your hypothesis**.

**What are different types of test design? ›**

Boundary Value Analysis (BVA) Equivalence Class Partitioning. Decision Table based testing. State Transition.

**Is AB testing the same as experimental design? ›**

**A/B testing is just that — testing one configured webpage (A) against another (B) and analyzing which is best**. Multivariate testing (MVT) is a newer term for experimental design and focuses on the changes of several variables on a given webpage to determine which settings are best.

**What statistical test is used for AB testing? ›**

An AB test is an example of **statistical hypothesis testing**, a process whereby a hypothesis is made about the relationship between two data sets and those data sets are then compared against each other to determine if there is a statistically significant relationship or not.

**How long should an ab test run? ›**

For you to get a representative sample and for your data to be accurate, experts recommend that you run your test for a minimum of **one to two week**.

### Which is very critical for the success of your ab test? ›

A/B testing uses the number of completed sales to determine which version performs best. It's particularly important to **look at the home page and the design of the product pages**, but it's also a good idea to consider all the visual elements involved in completing a purchase (buttons, calls-to-action).

**How do you write a B test plan? ›**

**How to use this A/B testing plan template**

- Hypothesis. The key part of a A/B test is formulating your hypothesis as this basically guides the whole A/B test plan. ...
- Experiment setup. ...
- Variations design. ...
- Results and learnings. ...
- Next actions to take from this experiment.

**What is the most accurate sample size? ›**

A good maximum sample size is usually around **10% of the population**, as long as this does not exceed 1000. For example, in a population of 5000, 10% would be 500. In a population of 200,000, 10% would be 20,000.

**Why is 300 a good sample size? ›**

As a general rule, sample sizes of 200 to 300 respondents **provide an acceptable margin of error and fall before the point of diminishing returns**.

**What is AB testing in real world example? ›**

A common example is to **test a slight change in the website's UI, with the goal of increasing the number of users that sign-up**. For instance, a test can be to create a variation of the current website with a slightly bigger sign-up button.

**What is AB testing in data analysis? ›**

A/B testing is a type of experiment in which you split your web traffic or user base into two groups, and show two different versions of a web page, app, email, and so on, with the goal of comparing the results to find the more successful version.

**What is the difference between AB testing and split testing? ›**

The term 'split testing' is often used interchangeably with A/B testing. The difference is simply one of emphasis: **A/B refers to the two web pages or website variations that are competing against each other**. **Split refers to the fact that the traffic is equally split between the existing variations**.

**What is the goal of AB testing? ›**

An A/B test, also known as a split test, is an experiment for determining which of different variations of an online experience performs better by presenting each version to users at random and analyzing the results.

**What is AB testing and what are the best practices for using it? ›**

**A/B testing best practices: how to build experiments that work**

- Figure out what to test. ...
- Tie experiments to specific KPIs. ...
- Leverage good data. ...
- Target the right audience. ...
- Create unique test variants. ...
- Schedule tests for the right time. ...
- Understand the statistical significance. ...
- Share results with your team.

**How do you write a test design? ›**

**How to write test cases: A step-by-step guide**

- Define the area you want to cover from the test scenario.
- Ensure the test case is easy for testers to understand and execute.
- Understand and apply relevant test designs.
- Use a unique test case ID.
- Use the requirements traceability matrix in testing for visibility.

### What are the 5 testing methods? ›

**There are many different types of testing, but for this article we will stick to the core five components of testing:**

- 1) Unit Tests. ...
- 2) Integration/System Tests. ...
- 3) Functional Tests. ...
- 4) Regression Tests. ...
- 5) Acceptance Tests.

**What are the 4 types of experimental design? ›**

Four major design types with relevance to user research are **experimental, quasi-experimental, correlational and single subject**. These research designs proceed from a level of high validity and generalizability to ones with lower validity and generalizability.

**What is an AB test and what is its purpose? ›**

A/B testing, also known as split testing, is **a marketing experiment wherein you split your audience to test a number of variations of a campaign and determine which performs better**. In other words, you can show version A of a piece of marketing content to one half of your audience, and version B to another.

**What is AB testing in agile? ›**

A/B testing (also called split test or split testing) means that **two variants of something (such as web pages, headlines, call-to-action buttons) are tested against each other and their performance is compared**. This means that variant A and variant B are each shown to a part of the target group (played out randomly).

**What is AB testing and why is it important? ›**

A/B testing, also known as split testing, is **a marketing experiment where two different versions of a campaign or a piece of content are tested on your audience to discern which performs better**. In other words, version A is shown to some of your audience and version B is shown to the others.

**When should you not use an ab test? ›**

**4 reasons not to run a test**

- Don't A/B test when: you don't yet have meaningful traffic. ...
- Don't A/B test if: you can't safely spend the time. ...
- Don't A/B test if: you don't yet have an informed hypothesis. ...
- Don't A/B test if: there's low risk to taking action right away.

**When should we use AB testing? ›**

A/B tests are used to optimize marketing campaigns, improve UI/UX, and increase conversions. There are multiple versions of A/B tests for testing individual pages, multiple variables, and entire workflows and funnels. A/B tests should be segmented, validated, and repeatable for maximum results.