## Air BnB’s method for estimating experimentation impact

When estimating the impact of implemented features we often use the A/B testing lift for that feature. However, winner’s curse makes our treatment lifts, on average, overestiamte the true impact of a feature.

The adjustment outlined below allows us to account for this bias and develop a more robust estimate of feature impact. See figure 1 for Air BnB’s example.

From an implementation perspective, the bias adjustment simply involves subtracting a term from the feature’s observed lift. It’s computationally efficient and simple to implement. However, developing a robust accuracy measure requires a holdout group, which is difficult to engineer.

**1) Determine the launch criteria for each experiment. **The launch criteria (*b*), the rule that determines whether an experiment will be put into production, is simply the critical point from our reference distribution that indicates statistical significance. For a t-test with an alpha of 0.05, *b* is 1.96.

**2) Calculate the bias for each experiment.** Bias is defined to be the “likelihood” that our experiment lift was observed due to natural variation in our data, conditioned on our selection criteria. This likelihood is represented by an unstandardized z-score.

**3) Subtract the bias term from each experiment and sum the effects.** We can now develop an adjusted estimate of the total experiment lift that takes into account the bias from the winner’s curse.

Pretty straightforward, right?

**Winner’s Curse**

Ok, to understand what the adjustment is doing, we first need to understand why the adjustment is needed at all. Winner’s curse is a concept borrowed from economics that postulates that winners of an auction always overpay. To understand this phenomenon, as well as its connection to online A/B testing, let’s look at a picture.

Here we see a bell curve representing all possible bids in an auction. Most people would bid near the item’s true value, which is represented by the middle red line. However, some people may be uninterested in the item so they’d underpay and leave the auction early. Others may be really interested in the item so they’d pay far beyond its true value. The most positively biased person would bid the most money and win the auction.

Given that setup, the winning bid must be larger than all other bids. Let’s define this using a condition where the left side of the pipe “|” is conditioned on the inequality on the right.

`Winning Bid := bid | bid > MAX(all other bids)`

Great. Now that we have this condition, let’s make the connection to online experimentation.

Unlike for an auction, when running an A/B test, we don’t know the **true** **impact** of a treatment. Instead, with some statistical significance to support our conclusion, **we’re using the winning (observed) impact to estimate the true impact**. So, when we select treatments to put into production we are cherry-picking the best solutions.

Now if we were selecting both very positive and negative impacts, our selection would be unbiased on average, but because we only pick treatments with high positive lift, we’re more likely to select overestimates.

Note that winner’s curse can be easily overcome with global holdout groups i.e. groups that don’t receive any of these features. From there, we can conduct an A/B test where the treatment is *all features are implemented* and the control is *none of the features are*. However, global holdouts pose several problems, a few of which are…

- High engineering lift,
- Users in this holdout won’t see the improved features, and
- Holdouts can’t be implemented to isolate a single team’s impact. If we want to compare the product and ML team’s experimentation impact separately, we’d need a holdout for each.

**The Math of Winner’s Curse**

It turns out that we can mathematically prove that our true experiment effect *on average* is less than our observed experiment impact **for experiments we put into production**. The last part of that sentence is key; if we aren’t selecting experiments to implement, there is no selection bias and the winner’s curse does not apply.

While the math isn’t necessary for implementation, it’s pretty slick so we’re going to quickly review.

We are looking to prove that the expectation (average) of our observed lifts is greater than the true, unmeasurable lifts.

Here, *O* is the sum of the observed experiment lifts and *T* is the sum of our hypothetical true lifts. Note the second quantity isn’t measurable. Also note that these sums are subject to the criteria that we implement their treatments. We will express this implementation condition as *A*.

To include the implementation criteria, we convert this equation to the sum in figure 7.

Here,

*j*is the iterator over our*n*total users,*i*is each experiment in*A*, our set of experiments to launch,*Oⱼ*is our effect observed during an A/B test, and*Tⱼ*is our true (unmeasurable) effect.

This next step is pretty wild. We define our experiment cutoff from the reference distribution to be *Δ / σ > b*, where b is the critical value from our distribution with confidence alpha. In English, we leverage our statistical significance criteria and subsititue that criteria into our equation. Note that a typical value for *b* is 1.96, which the t-statistic for a confidence level of 0.05 (one-sided).

The substitution results in the following equation. Note that we also move the expectation into the summation.

From here, we can subtract *Tⱼ*from both sides of our inequality to get figure 9.

Finally, because *Oⱼ *is strictly greater than *b*σ *(per our implementation criteria), we can guarantee that *E[O]>E[T]*. Working out each example is pretty verbose, but if you’re curious, check out section 3.2 for the full steps.

Not sure about you but I thought that inequality substitution was pretty crazy.

**Bias Adjustment**

Now that we have some intuition developed for our problem, let’s talk solutions.

Our goal is to account for the bias introduced when selecting experiments to implement; because we only select very positive treatments, we are more likely to be selecting overestimates.

To adjust our lift, we subtract the “likelihood” that this observation was an overestimate. Again, this likelihood is represented by a z-score multiplied by our metric’s standard deviation, which converts the standardized units to our metric’s units.

In figure 11, we see the full adjustment formula. The area to focus on is the numerator inside *φ*, our normal probability distribution function (PDF). We can see that the size of our adjustment term gets smaller as our observed lift (*Δ*) gets bigger (due to the negative sign in the normal pdf). **If our lift is much larger than our implementation criteria ( bᵢ*σᵢ), it’s more trustworthy and less likely to biased.**

And, all those standard deviations (*σ*)* *are simply used to move between standardized normal distribution (*N(0,1)*) units and our metric’s units. They’re unimportant from an intuition perspective; they simply facilitate the use of the normal pdf, *φ*.

**Confidence Intervals**

Now that we are able estimate the bias due to winner’s curse, we may want to calculate our confidence in this estimate. The paper discusses 3 confidence interval calculations: naïve, bootstrap, and unbiased bootstrap. If you want to calculate confidence intervals, just refer to the paper or any other discussion of these methods.

And there you have it; an unbiased measure of experiment impact.

- For this formulation, we assume the experiment impacts to be additive i.e. they were tested sequentially. If this is not the case, you will have to adapt the math to handle multiplicative impacts.
- As implied in the math, we also assume that each graduated experiment has positive and statistically significant lift i.e. the lift is greater than
*b*. If this is not the case, you must make some adjustments. Feel free to reach out if you have questions on the adjustments. - When implementing, it’s important to think critically about what your numbers mean. For instance, make sure that the target users in the experiment treatment are scaled up to be representative of your user population. If the experiment targets new users, when implementing globally be sure to scale by the number of new users, not the total number of users.
- As noted earlier, holdout groups are the most accurate way to assess total feature impact. If possible, when first implementing, run a QA on your estimate with a holdout group.

*Thanks for making it through this post! I’ll be writing 49 more posts that bring “academic” research to the DS industry. Check the comments for more links/ideas on sizing experiment impact.*

## FAQs

### How do you measure the impact of an AB test? ›

Formula: **Subtract revenue per session of the control from the test treatment.** **Then, divide that number by the revenue per session of test treatment and multiply the answer by 100**.

**How do you find the statistical significance of an AB test? ›**

In the context of AB testing experiments, statistical significance is **how likely it is that the difference between your experiment's control version and test version isn't due to error or random chance**. For example, if you run a test with a 95% significance level, you can be 95% confident that the differences are real.

**What is the p value for ab testing? ›**

What is P-value in A/B testing? P-value is defined as **the probability of observing an outcome as extreme or more extreme than the one observed, assuming that the null hypothesis is true**. Hence, the p-value is a mathematical device to check the validity of the null hypothesis.

**What is the significance level of AB testing? ›**

Ideally, all A/B test reach **95% statistical significance, or 90% at the very least**. Reaching above 90% ensures that the change will either negatively or positively impact a site's performance. The best way to reach statistical significance is to test pages with a high amount of traffic or a high conversion rate.

**What is the rule of thumb for AB test sample size? ›**

A general rule of thumb is: for a highly reliable test, you need **a minimum of 30,000 visitors and 3,000 conversions per variant**. If you follow this guideline, you'll generally achieve enough traffic and conversions to derive statistically significant results at a high level of confidence.

**What are examples of guardrail metrics? ›**

Guardrail Metrics: are metrics you don't want to mess up while optimizing the goal metrics. For instance, the maps team at Uber may aim to reduce pick up times. While doing so they should not reduce rides/user. In this case, rides/user is a guardrail metric.

**How do you measure significance results? ›**

**Researchers use a measurement known as the p-value** to determine statistical significance: if the p-value falls below the significance level, then the result is statistically significant. The p-value is a function of the means and standard deviations of the data samples.

**How do you find the significance level? ›**

To find the significance level, **subtract the number shown from one**. For example, a value of ". 01" means that there is a 99% (1-. 01=.

**What is a statistically significant result? ›**

If a result is statistically significant, that means **it's unlikely to be explained solely by chance or random factors**. In other words, a statistically significant result has a very low chance of occurring if there were no true effect in a research study.

**What is Bayesian AB testing? ›**

Instead, Bayesian A/B testing **focuses on the average magnitude of wrong decisions over the course of many experiments**. It limits the average amount by which your decisions actually make the product worse, thereby providing guarantees about the long run improvement of a metric.

### When a t test have ap value of 0.05 What does it mean? ›

If a p-value reported from a t test is less than 0.05, then that result is said to be **statistically significant**. If a p-value is greater than 0.05, then the result is insignificant.

**How many AB tests are successful? ›**

While your 'success' rate can happily hover at **10%** (and it absolutely will sometimes be lower), you need to ensure that your 'learning rate' is at 100%. A great way to do this is to ensure that your experiments are seeking to validate a hypothesis born out of user research, data or other insights.

**What if AB test is not significant? ›**

It only means that you did not have enough evidence to reject it. If your test is not significant enough, there is a good chance you might accept a design that reduces your conversion rates (Type I errors). The best way to avoid these errors is by **increasing your confidence threshold to a minimum of 95%**.

**What do you do if AB test is not significant? ›**

**If your results are not statistically significant, it could be that:**

- Your sample size is not large enough. ...
- Your effect size is too small. ...
- Run your test for longer. ...
- Dig deeper into your results. ...
- Utilise other tools for further information. ...
- Calculate your sample size before running the test.

**What 3 things must be considered to determine sample size for a test? ›**

In general, three or four factors must be known or estimated to calculate sample size: (1) the effect size (usually the difference between 2 groups); (2) the population standard deviation (for continuous data); (3) the desired power of the experiment to detect the postulated effect; and (4) the significance level.

**What sample size is statistically significant? ›**

**The minimum sample size is 100**

Most statisticians agree that the minimum sample size to get any kind of meaningful result is 100. If your population is less than 100 then you really need to survey all of them.

**What are guardrails in agile? ›**

Guardrails **help all members of the development team move in the same direction with the same goals**. For example, stakeholders define which goal or problem the project must solve, and developers define how to solve that problem. Guardrails create an environment where developers can make decisions independently.

**How do you measure safety metrics? ›**

**Here are 5 key safety metrics to consider:**

- Incidents and Near Misses. Historically, many organizations have focused the majority of their EHS reporting on lagging indicators, like incidents and near misses – which is a great starting point. ...
- Inspections and Audits Completed. ...
- Corrective Actions.

**What are the Lean Agile guardrails? ›**

Lean Budget Guardrails **describe the policies and practices for budgeting, spending, and governance for a specific portfolio**. SAFe provides strategies for Lean budgeting that eliminates the overhead of traditional project-based funding and cost accounting.

**What is a good p-value? ›**

A p-value **less than 0.05 (typically ≤ 0.05)** is statistically significant. It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random).

### What is the p-value significance level? ›

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. P-value can serve as an alternative to—or in addition to—preselected confidence levels for hypothesis testing.

**What p-value is statistically significant? ›**

If the p-value is **under .** **01**, results are considered statistically significant and if it's below . 005 they are considered highly statistically significant.

**How do you do a 5% level of significance test? ›**

To graph a significance level of 0.05, we need to **shade the 5% of the distribution that is furthest away from the null hypothesis**. In the graph above, the two shaded areas are equidistant from the null hypothesis value and each area has a probability of 0.025, for a total of 0.05.

**How do you do a 5% significance level? ›**

For example, if the desired significance level for a result is 0.05, **the corresponding value for z must be greater than or equal to z ^{*} = 1.645** (or less than or equal to -1.645 for a one-sided alternative claiming that the mean is less than the null hypothesis).

**What is 20% level of significance? ›**

Common significance levels are 0.10 (1 chance in 10), 0.05 (**1 chance** in 20), and 0.01 (1 chance in 100). The result of a hypothesis test, as has been seen, is that the null hypothesis is either rejected or not.

**What is the difference between p-value and level of significance? ›**

**The term significance level (alpha) is used to refer to a pre-chosen probability and the term "P value" is used to indicate a probability that you calculate after a given study**.

**What is 1% significance level? ›**

Level of Significance Symbol

Similarly, significant at the 1% means that **the p-value is less than 0.01**. The level of significance is taken at 0.05 or 5%. When the p-value is low, it means that the recognised values are significantly different from the population value that was hypothesised in the beginning.

**How do you find the statistical significance between two groups? ›**

**How to calculate statistical significance**

- Create a null hypothesis.
- Create an alternative hypothesis.
- Determine the significance level.
- Decide on the type of test you'll use.
- Perform a power analysis to find out your sample size.
- Calculate the standard deviation.
- Use the standard error formula.
- Determine the t-score.

**What is sanity check in AB testing? ›**

Definition: Sanity testing is **a subset of regression testing**. After receiving the software build, sanity testing is performed to ensure that the code changes introduced are working as expected . This testing is a checkpoint to determine if testing for the build can proceed or not.

**How do you conduct Bayesian AB testing? ›**

**Steps of Bayesian A/B Testing**

- Select your distribution based on your metric of interest. Here, we discuss the binomial, multinomial, and exponential distributions. ...
- Calculate your prior. ...
- Run the experiment.
- Calculate three key metrics using Monte Carlo simulations.

### What is AB testing in agile? ›

A/B testing (also called split test or split testing) means that **two variants of something (such as web pages, headlines, call-to-action buttons) are tested against each other and their performance is compared**. This means that variant A and variant B are each shown to a part of the target group (played out randomly).

**Is p-value 0.15 significant? ›**

If the p-value is less than 0.05, we reject the null hypothesis that there's no difference between the means and conclude that a significant difference does exist. **If the p-value is larger than 0.05, we cannot conclude that a significant difference exists.**

**Is p-value above 0.05 significant? ›**

If the p-value is 0.05 or lower, the result is trumpeted as significant, but if it is higher than 0.05, the result is **non-significant** and tends to be passed over in silence.

**What is T * at the 0.05 significance level? ›**

A significance level of (for example) 0.05 indicates that **in order to reject the null hypothesis, the t-value must be in the portion of the t-distribution that contains only 5% of the probability mass**.

**What do you measure in a B test? ›**

A/B testing is essentially an experiment where **two or more variants of a page are shown to users at random**, and statistical analysis is used to determine which variation performs better for a given conversion goal.

**How do you find the p value? ›**

The p-value is calculated using the sampling distribution of the test statistic under the null hypothesis, the sample data, and the type of test being done (lower-tailed test, upper-tailed test, or two-sided test). The p-value for: a lower-tailed test is specified by: **p-value = P(TS ts | H _{0} is true) = cdf(ts)**

**How do you calculate lift for a test? ›**

Lift is calculated by **taking the difference between the treatment conversion rate and the control conversion rate divided by the control conversion rate**. The result is the relative percent difference of treatment compared to control.

**What are four levels of measurement B What are reliability and validity of a measure? ›**

There are four levels of measurement – **nominal, ordinal, and interval/ratio** – with nominal being the least precise and informative and interval/ratio variable being most precise and informative.

**Which is a positive outcome of conducting a B tests? ›**

A/B testing lets you **increase user engagement**, reduce bounce rates, increase conversion rates, minimize risk, and effectively create content.

**What are the test measures? ›**

Tests and measurements are standardized instruments, such as **questionnaires, inventories, and scales**, which are used to measure constructs in various social science disciplines. They are used for diagnosis, research, or assessment.

### What is p-value from Anova? ›

So, the P-value is **the probability of obtaining an F-ratio as large or larger than the one observed**, assuming that the null hypothesis of no difference amongst group means is true.

**Is p-value of 0.05 significant? ›**

**If the p-value is 0.05 or lower, the result is trumpeted as significant**, but if it is higher than 0.05, the result is non-significant and tends to be passed over in silence.

**What is uplift in AB testing? ›**

Uplift: **The difference between the performance of a variation and the performance of a baseline variation (usually the control group)**. For example, if one variation has a revenue per user of $5, and the control has a revenue per user of $4, the uplift is 25%.

**What is the confidence interval for AB tests? ›**

Generally in the case of email A/B testing, a confidence level of **95% or above** is recommended. Therefore, in our example, we can be very confident that using Variation B is superior for producing the intended outcome (clickthroughs), and so we can confidently use it for our broader campaign.

**What is a lift formula? ›**

The lift coefficient is defined as: C_{L} = L/qS , where L is the lift force, S the area of the wing and **q = (rU ^{2}/2)** is the dynamic pressure with r the air density and U the airspeed.