And avoiding the common mistakes that derail most test efforts.
This article is the 3rd one in my series of articles about A/B Testing.
In the first article, I presented the intuition behind A/B testing and the importance of establishing the magnitude of the effect you hope to observe and corresponding sample size.
In the second article, I talked about how product managers can design A/B Tests in a manner that speeds the tests up.
In this 3rd article, I will talk about another aspect of A/B Test: What factors should you consider when determining the metric for your A/B Test?
Case Study: VRBO Search Landing Pages
When we design an A/B test, we select a primary metric that we hope to improve (and several secondary metrics) and measure it both the variant and control group. If we don’t choose this metric carefully, we are wasting our time.
We’ll use an example that I am familiar with: Search Landing Pages on VRBO. VRBO is a two-sided marketplace where homeowners can list their homes for rent, and potential travelers can find the right accommodation for the next trip. The purpose of the Search Landing Page is to receive traffic from google and convert that traffic into people who perform higher intent inquiries.
Let’s look at some screenshots, starting with the most common way travelers start their planning process (searching on Google)
Step 1: Thinking about traveling to the Bahamas? Let’s search.
Step 2: Aha! It looks Like VRBO has excellent options. Let’s look there.
Step 3: Let’s find out what options I have on the Bahamas.
We built this page for:
- High booking intent users. Users who may already have booked their flights or at least have a sense for when they want to travel. We hypothesized that for these users, the page’s jobs-to-be-done was to find homes that were available for their travel dates.
- Low booking intent users. Users who were very early in the planning stage and may not have any sense of when they might be traveling. We hypothesized that for these users, the page’s jobs-to-be-done was to help them explore the variety of homes available and to influence the user to visit the Bahamas.
- Google Bot. We wanted Google to index the page for most relevant user queries.
The whole user journey (from landing on VRBO from Google to booking looks like this.)
There are two specific things to note:
- Between the initial and the final step, there are multiple steps that a user must take, and at each level, some users will drop out.
- Since travel is considered purchase (vs. an impulse purchase), the time between the initial and the final step may be in the order of weeks.
Now let’s look at the mathematics of this conversion funnel, make some assumptions about the conversion from one step to another and estimate the overall conversion rate. (Disclaimer: These numbers are for illustrative purpose only)
Finally, you have to get a rough order of magnitude of the traffic. Let’s make some assumptions here (Disclaimer: These numbers are for illustrative purpose only.)
- Total unique visitors per month: 10 million
- Unique new visitors arriving on the search landing page: 30% ( 3 Million Users)
Now imagine you are the product manager for the Search Landing Page. Against these base rates, let’s look at a hypothetical A/B test and look at two possible metrics for your experiment.
Test Hypothesis: By adding a background “hero” image on the search landing page that is indicative of the destination, users will feel the comfort that they are looking at the right destination, leading to higher searches and overall higher conversion by 2%
You have two choices of metrics, overall conversion, and % of users doing dated searches.
Experiment Design when we choose Overall Conversion as our metric
It’s very tempting to use the overall conversion rate as the metric for the product manager. After all, you can tell your management that you have increased your revenue by $$$.
If you decide to choose this as your metric, let’s look at the test parameters: test sample size and overall test duration. Let’s plug our base rate of 0.225% and minimum detectable effect (MDE) of 2% into Evan Miller’s Sample Size calculator.
Overall, you will need 34,909,558 samples across your variant and control groups.
With 3 million unique users per month, this will require 11–12 months for your test to complete, if you do this test correctly. A lot of people will make the mistake of seeing some positive results earlier, get impatient, and stop the experiment prematurely. If you do that, you are most likely looking at a false positive.
Experiment Design when we chose % of users doing a dated-search as the primary metric
If you decide to chose this as your metric, let’s look at the test parameters: test sample size and overall test duration. Let’s plug our base rate of 30% and minimum detectable effect (MDE) of 2% into Evan Miller’s Sample Size calculator.
Overall, you will need 183,450 samples across your variant and control groups. With 3 million unique users per month, this will require a few days for your test to complete. [You may want to consider running the test for a whole week to eliminate any chance of a day-of-the-week bias.]
With this approach, you can run 10’s of experiments in the same amount of time.
Lessons Learned
If the above situation sounds hypothetical, let me assure you that plenty of product managers (including me) have taken the approach of using the overall conversion rate as the primary metric. Here are some of the lessons I learned that I’d like to share more broadly.
- When you design your test, pay sufficient attention to the metric you pick. If the feature you are testing is higher up in the funnel and your overall conversion rate is less than 1%, your tests results will take months to complete. (Unless you are Facebook, Google, Amazon, Indeed, or a top internet site.)
- When your test takes months to complete, the likelihood of a bug creeping in due to an unintended and unrelated change and corrupting your test results will be extremely high. You may have to restart your test.
- The further your feature is from the overall conversion, the lower the likelihood of your change causally impacting the metric.
- The best option is to use a metric that is directly impacted by your change, such as an on-page metric, to measure micro-conversions.
- If you chose an on-page metric like click-through-rate, pay attention to unintended consequences by looking at a counterbalancing metric. If we select dated-search as our metric, we will also look at the bounce rate from that page as well as the following page. This technique ensures that the product change is not sending unqualified traffic downstream. (More on this topic in a future article.
If you found this article useful, let me know. If you have any questions or doubts about A/B Testing, drop me a note in the comments and I’ll consider it as a topic for a future post.
This is the 3rd article in my set of articles on A/B Testing. The other articles in the series are:
- The intuition behind A/B Testing — A Primer for New Product Managers
- How to split the traffic in an A/B Test
I want to give credit to Evan Miller for his excellent sample size calculator and his thought leadership on the topic of A/B testing.
About Me: Aditya Rustgi is a product management leader with over 15 years of product and technology leadership experience in Multi-sided Marketplaces and B2B Saas business models in eCommerce and Travel Industry. Most recently, he was a director of product management at VRBO, an Expedia Company.
FAQs
How do I choose my AB test metrics? ›
Choosing the best A/B testing metrics to measure the quality of your search engine. You can measure the quality of a search engine in two ways: quantitatively and qualitatively. Quantitative evaluation looks at metrics such as speed and throughput. Providing fast searching with high throughput is essential.
How do you know if you have enough sample size AB testing? ›To A/B test a sample of your list, you need to have a decently large list size — at least 1,000 contacts. If you have fewer than that in your list, the proportion of your list that you need to A/B test to get statistically significant results gets larger and larger.
What are AB testing metrics? ›A/B testing, also known as split testing, refers to a randomized experimentation process wherein two or more versions of a variable (web page, page element, etc.) are shown to different segments of website visitors at the same time to determine which version leaves the maximum impact and drives business metrics.
How do you evaluate the results of an AB test? ›- Sample Size.
- Significance level.
- Test duration.
- Number of conversions.
- Analyze external and internal factors.
- Segmenting test results (the type of visitor, traffic, and device)
- Analyzing micro-conversion data.
...
Here are some key considerations:
- Making Comparisons.
- Databases and Benchmarks.
- Trends Over Time.
- Service Delivery Systems.
- Unit of Analysis.
- Staff Experience and User Support.
- Costs.
- Identify processes to measure. Prepare a test plan by reviewing the application requirements. ...
- Define the baseline. Once you define the testing metrics, share the details with the management and stakeholders for approval. ...
- Calculate the actual values. ...
- Identify areas of improvement.
- Define population size or number of people.
- Designate your margin of error.
- Determine your confidence level.
- Predict expected variance.
- Finalize your sample size.
In general, three or four factors must be known or estimated to calculate sample size: (1) the effect size (usually the difference between 2 groups); (2) the population standard deviation (for continuous data); (3) the desired power of the experiment to detect the postulated effect; and (4) the significance level.
What if the sample size is not enough? ›A study with an insufficient sample size may not have sufficient statistical power to detect meaningful effects and may produce unreliable answers to important research questions. On the other hand, a study with an excessive sample size wastes resources and may unnecessarily expose study participants to potential harm.
What are test metrics examples? ›- The number of defects returned per team member.
- The number of open bugs to be retested by each team member.
- The number of test cases allocated to each team member.
- The number of test cases executed by each team member.
What are the different types of metrics used in testing? ›
- Absolute numbers. Total number of test cases. ...
- Test Tracking and Efficiency. ...
- Test effort. ...
- Test effectiveness. ...
- Test coverage. ...
- Test economics metrics. ...
- Test team metrics. ...
- Test execution status.
- The Trackable Goal is Too Far Removed From the Test. ...
- Not Starting with Enough Data. ...
- Going Off of Your Gut Instinct Only. ...
- Building a Hypothesis That Cannot Be Quantified.
What is P-value in A/B testing? P-value is defined as the probability of observing an outcome as extreme or more extreme than the one observed, assuming that the null hypothesis is true. Hence, the p-value is a mathematical device to check the validity of the null hypothesis.
How do you analyze test scores? ›To accurately interpret test scores, the teacher needs to analyze the performance of the test as a whole and of the individual test items, and to use these data to draw valid inferences about student performance. This information also helps teachers prepare for posttest discussions with students about the exam.
What are key quality metrics? ›Quality metrics are measurements of the value and performance of your business's products, services, and processes. Quality metrics can be used to help assess customer satisfaction levels, identify areas for improvement within your company, and track the overall quality of your products or services.
What are the seven criteria for good metrics? ›- Consistency. Data has no contradictions in your databases. ...
- Accuracy. Data is error-free and exact. ...
- Completeness. Data records are “full” and contain enough information to draw conclusions. ...
- Auditability. ...
- Validity. ...
- Uniqueness. ...
- Timeliness.
Key Metrics:
Test execution completion status. Test case execution productivity. Defect density. Defect priority/defect severity.
- Lead time for changes. One of the critical DevOps metrics to track is lead time for changes. ...
- Change failure rate. The change failure rate is the percentage of code changes that require hot fixes or other remediation after production. ...
- Deployment frequency. ...
- Mean time to recovery.
- List what you're currently measuring. Close ratios? ...
- Find a single additional area that you can measure. Could you use better data from your website? ...
- Track and review your measurements. ...
- Involve your team. ...
- Repeat the process.
- Step 1: Articulate Your Goals. This is obvious, but you should always start by defining your goals for your product. ...
- Step 2: List the Actions That Matter. ...
- Step 3: Define Your Metrics. ...
- Step 4: Evaluate your Metrics.
How do you know if a sample size is large enough in statistics? ›
Often a sample size is considered “large enough” if it's greater than or equal to 30, but this number can vary a bit based on the underlying shape of the population distribution. In particular: If the population distribution is symmetric, sometimes a sample size as small as 15 is sufficient.
What is considered a large enough sample? ›In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40.
How much sample size is enough for quantitative? ›Summary: 40 participants is an appropriate number for most quantitative studies, but there are cases where you can recruit fewer users.
What is the first thing you need to know before you can determine a sample size? ›To calculate your sample size, it is important to have an accurate definition of your target group, an estimation of your total population and the margin of error you want to accept for a certain confidence level (often 95% confidence level and 5% margin of error is taken).
What are the major factors that influence determining a sample size? ›The factors affecting sample sizes are study design, method of sampling, and outcome measures – effect size, standard deviation, study power, and significance level.
What are the two most important considerations in determining an appropriate sample size? ›- Know how variable the population is that you want to measure. ...
- Know how precise the population statistics need to be. ...
- Know exactly how confident you must be in the results.
For example, when we are comparing the means of two populations, if the sample size is less than 30, then we use the t-test. If the sample size is greater than 30, then we use the z-test.
How do you justify a small sample size? ›The only aspect a researcher needs to justify for a sample size justification based on accuracy is the desired width of the confidence interval with respect to their inferential goal, and their assumption about the population standard deviation of the measure.
Why is it important to determine the sample size? ›What is sample size and why is it important? Sample size refers to the number of participants or observations included in a study. This number is usually represented by n. The size of a sample influences two statistical properties: 1) the precision of our estimates and 2) the power of the study to draw conclusions.
What is the most important test metric? ›Bug Find Rate: One of the most important metrics used during the test effort percentage is bug find rate. It measures the number of defects/bugs found by the team during the process of testing.
What are the three main metrics? ›
- Customer lifetime value (CLV) What is every new customer worth over the lifetime of their relationship with your business? ...
- Cost of customer acquisition (CAC) What does it cost to acquire new customers? ...
- Gross margin.
- Technology metrics – component and application metrics (e.g. performance, availability…)
- Process metrics – defined, i.e. measured by CSFs and KPIs.
- Service metrics – measure of end-to-end service performance.
- Sales revenue. Perhaps one of the most informative business metrics is revenue. ...
- Net profit margin. ...
- Gross margin. ...
- Lead conversion rates. ...
- Website traffic. ...
- Retention rate. ...
- Customer acquisition cost. ...
- Customer lifetime value.
- Productivity.
- Gross Profit Margin.
- Return on Investment.
- Earned Value.
- Customer Satisfaction.
- Employee Satisfaction Score.
- Actual Cost.
- Cost Variance.
- 5 Essential Steps for Successful A/B Tests Based on Data. ...
- Step 1: Find Your Best Opportunities. ...
- Step 2: Understand Visitor Needs. ...
- Step 3: Use Data & Insights to Inform Testing. ...
- Step 4: Run Your First Tests. ...
- Step 5: Measure Success.
One common mistake with A/B testing is running the split test too soon. For example, if you start a new OptinMonster campaign, you should wait a bit before starting a split test. At first, there's no point in creating a split test because you won't have data to create a baseline for comparison.
What are the limitations of AB testing? ›- 1) Time and resources. Running an A/B test can take longer than other methods of testing and can be a drain on resources and time – two things marketing teams often lack. ...
- 2) Fluctuating winner. ...
- The alternative.
In the context of AB testing experiments, statistical significance is how likely it is that the difference between your experiment's control version and test version isn't due to error or random chance. For example, if you run a test with a 95% significance level, you can be 95% confident that the differences are real.
How do you determine statistical significance for AB test? ›- Step 1: Create your null and alternate hypothesis. ...
- Step 2: Choose your test. ...
- Step 3: Determine your alpha value. ...
- Step 4: Determine the sample size. ...
- Step 5: Calculate the required values. ...
- Step 6: Use a t-table to find your statistical significance.
High p-values indicate that your evidence is not strong enough to suggest an effect exists in the population. An effect might exist but it's possible that the effect size is too small, the sample size is too small, or there is too much variability for the hypothesis test to detect it.
How do you evaluate a good test? ›
...
- Reliability or Consistency. How to Make Sure Your Test Is Reliable?
- Validity.
- Objectivity.
- Comprehensiveness.
- Absence of Ambiguity.
- Preparation.
- Get informed. Don't walk into your test unprepared for what you will face. ...
- Think like your teacher. ...
- Make your own study aids. ...
- Practice for the inevitable. ...
- Study every day. ...
- Cut out the distractions. ...
- Divide big concepts from smaller details. ...
- Don't neglect the “easy” stuff.
Item analysis is essential in improving items which will be used again in later tests; it can also be used to eliminate misleading items in a test. The study focused on item and test quality and explored the relationship between difficulty index (p-value) and discrimination index (DI) with distractor efficiency (DE).
What are your top metric choices for measuring success? ›- Break-even point. ...
- Net income ratio. ...
- Monthly recurring revenue. ...
- Leads, conversion and bounce rate. ...
- ROI and ROAS. ...
- Customers. ...
- Employee satisfaction.
DRE (Defect Removal Efficiency) is a powerful metric used to measure test effectiveness. From this metric we come to know how many bugs we found from the set of bugs which we could have found.
What metrics should be used for unit testing? ›- Code coverage %
- Cyclomatic complexity.
- Test Pass %
- Build Quality.
- Test execution coverage %
- Requirements coverage %
- % defects by priority (Low, medium & high)
- Defect rate % (No. of defects identified during testing)
- Go to your Optimize Account (Main menu > Accounts).
- Select your container.
- Click Create experiment.
- Enter an experiment name (up to 255 characters).
- Enter an editor page URL (the web page you'd like to test).
- Click A/B test.
- Click Create.
Here are some examples of success metrics: Your average weekly visits. Your average page views. Number of products sold per week/month.
What are examples of metrics? ›Key financial statement metrics include sales, earnings before interest and tax (EBIT), net income, earnings per share, margins, efficiency ratios, liquidity ratios, leverage ratios, and rates of return. Each of these metrics provides a different insight into the operational efficiency of a company.
What are key metrics? ›Key Metrics are the tactical initiatives you and your web team identify for your website. These are the types of visitor actions that are helping your organization reach its overall objectives, whether that is lead generation, digital engagement, or customer satisfaction.
What metric is a good indicator of progress in the testing phase? ›
QA metrics are used to estimate the progress of testing and its results.
What is the importance of test metrics? ›It helps teams clearly define your testing goals and attach a quantifiable number to success (or failure). You want to be able to measure and monitor your testing activities, as well as having a snapshot of your team's test progress, productivity and the quality of testing to ensure you're hitting your goals.