What is A/B Testing? An Advanced Guide + 29 Guidelines

Last Updated on July 30, 2024 by Alex Birkett

A/B testing (aka split testing or online controlled experiments) is hard. It’s sometimes billed as a magic tool that spits out a decisive answer. It’s not. It’s a randomized controlled trial, albeit online and with website visitors or users, and it’s reliant upon proper statistical practices.

At the same time, I don’t think we should hold the standards so high that you need a data scientist to design and analyze every single experiment. We should democratize the practice to the most sensible extent, but we should create logical guardrails so the experiments that are run are run well.

The best way to do that I can think of is with education and a checklist. If it works for doctors, I think we can put it to use, too.

So this article is two things: a high level checklist you can use on a per test basis (you can get a Google Docs checklist here), and a comprehensive guide that explains each checklist item in detail. It’s a choose your own adventure. You can read it all (including outbound links), or just the highlights.

Also, don’t expect it to be completely extensive or cover every fringe case. I want this checklist to be usable by people at all levels of experimentation, and at any type of company (ecommerce, SaaS, lead generation, whatever). As such, I’ll break it into three parts:

The Basics – don’t run experiments if you don’t follow these guidelines. If you follow these, ~80% of your experiments should be properly run.
Intermediate Topics – slightly more esoteric concepts, but still largely useful for anyone running tests consistently. This should help reduce errors in ~90% of experiments you run.
Advanced Topics – won’t matter for most people, but will help you decide on fringe cases and more advanced testing use cases. This should bring you up to ~95-98% error reduction rate in running your tests.

I’ll also break this up into simple heuristics and longer descriptions. Depending on your level of nerdiness or laziness, you can choose your own adventure:

Basics
- Basics with Explanations
Intermediate
- Intermediate with Explanations
Advanced
- Advanced with Explanations

The frustrating part about making a guide or a checklist like this is there is so much nuance. I’m hyper aware that this will never be complete, so I’m setting the goal to be useful. To be useful means it can’t run on for the length of a textbook, though it almost does at ~6000 words.

(In the case that you want to read a textbook, read this one).

I’m not reinventing the wheel here. I’m basically compiling this from my own experiences, my mentors, papers from Microsoft, Netflix, Amazon, Booking.com and Airbnb, and other assorted sources (all listed at the end).

What is A/B Testing?

A/B testing is a controlled experiment (typically online) where two or more different versions of a page or experience are delivered randomly to different segments of visitors. Imagine a homepage where you’ve got an image slider above the fold, and then you want to try a new version instead showing a product image and product description next to a web form. You could run a split test, measure user behavior, and get the answer as to which is optimal:

Statistical analysis is then performed to infer the performance of the new variants (the new experience or experiences, version B/C/D, etc.) in relation to the control (the original experience, or version A).

A/B tests are performed commonly in many industries including ecommerce, publications, and SaaS. In addition to running experiments on a web page, you can set up A/B tests on a variety of channels and mediums, including Facebook ads, Google ads, email newsletter workflows, email subject line copy, marketing campaigns, product features, sales scripts, etc. – the limit is really your imagination.

Experimentation typically falls under one of several roles or titles, which vary by industry and company. For example, A/B testing is strongly associated with CRO (conversion optimization or conversion rate optimization) as well as product management, though marketing managers, email marketers, user experience specialists, performance marketers, and data scientists or analysts may also run A/B tests.

The Basics: 10 Rules of A/B Testing

Decide, up front, what the goal of your test is and what metric matters to you (the Overall Evaluation Criterion).
Plan upfront what action you plan on taking in the event of a winning, losing, or inconclusive result.
Base your test on a reasonable hypothesis.
Determine specifically which audience you’ll be targeting with this test.
Estimate your minimum detectable effect, required sample size, statistical power, and how long your test will be required to run before you start running the test.
Run the test for full business cycles, accounting for naturally occurring data cycles.
Run the test for the full time period you had planned, and only then determine the statistical significance of the test (normally, as a rule of thumb, accepting a p value of <.05 as “statistically significant”).
Unless you’re correcting for multiple comparisons, stick to running one variant against the control (in general, keep it simple), and using a simple test of proportions, such as Chi Square or Z Test, to determine the statistical significance of your test.
Be skeptical about numbers that look too good to be true (see: Twyman’s Law)
Don’t shut off a variant mid test or shift traffic allocation mid test

The Basics of A/B Testing: Explained

1. Decide Your Overall Evaluation Criterion Up Front

Where you set your sights is generally where you end up. We all know the value of goal setting. Turns out, it’s even more important in experimentation.

Even if you think you’re a rational, objective person, we all want to win and to bring results. Whether intentional or not, sometimes we bring results by cherry picking the data.

Here’s an example (a real one, from the wild). Buffer wants to A/B test their Tweets. They launch two of ‘em out:

Can you tell which one the winner was?

Without reading their blog post, I genuinely could not tell you which one performed better. Why? I have no idea what metric they’re looking to move. On Tweet two, clicks went down but everything else went up. If clicks to the website is the goal, Tweet one is the winner. If retweets, tweet number two wins.

So, before you ever set a test live, choose your overall evaluation criterion (or North Star metric, whatever you want to call it), or I swear to you, you’ll start hedging and justifying that “hey, but click through rate/engagement/time on site/whatever increase on the variation. I think that’s a sign we should set it live.” It will happen. Be objective in your criterion.

(Side note, I’ve smack talked this A/B test case study many times, and there are many more problems with it than just the lack of a single metric that matters, including not controlling for several confounding variables – like time – or using proper statistics to analyze it.)

Make sure, then, that you’re properly logging your experiment data, including number of visitors and their bucketing, your conversion goals, and any behavior necessary to track in the conversion funnel.

2. Plan Your Proposed Action Per Test Result

What do you hope to do if your test wins? Usually this is a pretty easy answer (roll it out live, of course).

But what do you plan to do if your test loses? Or even murkier, what if it’s inconclusive?

I realize this sounds simple on paper. You might be thinking, “move onto the next test.” Or “try out a different variation of the same hypothesis.” Or “test on a larger segment of our audience to get the necessary data.”

That’s the point, there are many decisions you could make that affect your testing process as a whole. It’s not as simple as “roll it out live” or “don’t roll it out live.”

Say your test is trending positive but not quite significant at a p value of < .05. You actually do see a significant lift, though, in a micro-conversion, like click through rate. What do you do?

It’s not my place to tell you what to do. But you should state your planned actions up front so you don’t run into the myriad of cognitive biases that we humans have to deal with.

3. Base your test on a reasonable hypothesis

What is a hypothesis, anyway?

It’s not a guess as to what will happen in your A/B test. It’s not a prediction. It’s one big component of ye old Scientific Method.

A good hypothesis is “a statement about what you believe to be true today.” It should be falsifiable, and it should have a reason behind it.

This is the best article I’ve read on experiment hypotheses: https://medium.com/@talraviv/thats-not-a-hypothesis-25666b01d5b4

I look at developing a hypothesis as a process of being clear in my thinking and approach to the science of A/B testing. It slows me down, and it makes me think “what are we doing here?” As the article above states, not every hypothesis needs to be based on mounds of data. It quotes Feynman: “It is not unscientific to take a guess, although many people who are not in science believe that it is.”

I do believe any mature testing program will require the proper use of hypotheses. Andrew Anderson has a different take, and a super valid one, about the misuse of hypotheses in the testing industry. I largely agree with his take, and I think it’s mostly based on the fact that most people are using the term “hypothesis” incorrectly.

4. Determine specifically which audience you’ll be targeting with this test

This is relatively quick and easy to understand. Which population would you like to test on – desktop, mobile, PPC audience #12, users vs. non-users, customer who read our FAQ page, a specific sequence of web pages etc. – and how can you take measures to exclude the data of those who don’t apply to that category?

It’s relatively easy to do this, at least for broad technological categorizations like device category, using common A/B testing platforms.

Point is this: you want to learn about a specific audience, and the less you pollute that sample, the cleaner your answers will be.

5. Estimate your MDE, sample size, statistical power, and how long your test will run before you run it

Most of the work in A/B testing comes before you ever set the test live. Once it’s live, it’s easy! Analyzing the test after the fact is especially easier if you’ve done the hard and prudent work up front.

What do you need to plan? The feasibility of your test in terms of traffic and time length, what minimum detectable effect you’d need to see to discern an uplift, and the sample size you’ll need to reach to consider analyzing your test.

It sounds like a lot, but you can do all of this with the help of an online calculator.

I actually like to use a spreadsheet that I found on the Optimizely knowledge base (here’s a link to the spreadsheet as well). It visually shows you how long you’d have to run a test to see a specific effect size, depending on the amount of traffic you have to the page and the baseline conversion rate.

You can also use Evan Miller’s Awesome A/B testing tools. Or, CXL has a bunch of them as well. Search Discovery also has a calculator with great visualizations.

6. Run the test for full business cycles, accounting for naturally occurring data cycles

One of the first and common mistakes everyone makes when they start A/B testing is calling a test when it “reaches significance.” This, in part, must be because in our daily lives, the term “significance” means “of importance” so it sounds final and deterministic.

Statistical significance (or the confidence level) is just an output of some simple math that tells you how unlikely a result is given the assumption that both variants are the same.

Huh?

We’ll talk about p-values later, but for now, let’s talk about business cycles and how days of the week can differ.

The days of the week tend to differ quite a bit. Our goal in A/B testing is to get a representative sample of our population, which general involves collecting enough data that we smooth out for any jagged edges, like a super Saturday where conversion rates tank and maybe the website behavior is different.

Website data tends to be non-stationary (as in, they change over time) or sinusoidal – or rather, it looks like this:

While we can’t reduce the noise to zero, we can run our tests for full weeks and business cycles to try to smooth things out as much as possible.

7. Run the test for the full time period you had planned

Back to those pesky p-values. As it turns out, an A/B test can dip below a .05 p-value (the commonly used rule to determine statistical significance) at many points during the test, and at the end of it all, sometimes it can turn out inconclusive. That’s just the nature of the game.

Anyone in the CRO space will tell you that the single most common mistake people make when running A/B tests is ending the test too early. It’s the ‘peaking’ problem. You see that the test has “hit significance,” so you stop the test, celebrate, and launch the next one. Problem? It may not have been a valid test.

The best post written about this topic, aptly titled, is Evan Miller’s “How Not To Run An A/B Test.” He walks through some excellent examples to illustrate the danger with this type of peaking.

Essentially, if you’re running a controlled experiment, you’re generally setting a fixed time horizon at which you view the data and make your decision. When you peak before that time horizon, you’re introducing more points at which you can make an erroneous decision and the risk of a false positive goes wayyy up.

8. Stick to testing only one variant (unless you’re correcting for it…)

Here we’ll introduce an advanced topic: the multiple comparisons problem.

When you test several variants, you run into a problem known as “cumulative alpha error.” Basically, with each variant, sans statistical corrections, you risk a higher and higher probability of seeing a false positive. KonversionsKraft made a sweet visualization to illustrate this:

This looks scary, but here’s the thing: almost every major A/B testing tool has some built in mechanism to correct for multiple comparisons. Even if your testing tool doesn’t, or if you use a home-brew testing solution, you can correct for it yourself very simply using one of many methods:

However, if you’re not a nerd and you just want to test some shit and maybe see some wins, start small. Just one v one.

When you do feel more comfortable with experimentation, you can and should look into expanding into A/B/n tests with multiple variants.

This is a core component of Andrew Anderson’s Discipline Based Testing Methodology, and if I can, I’ll wager to say it’s because it increases the beta of the options, or the differences between each one of the experiences you test. This, at heart, decreases your reliance on hard opinions or preconceived ideas about “what works” and opens you up to trying things you may not of in a simple A/B test.

But start slowly, keep things simple.

9. Be skeptical about numbers that look too good to be true

If there’s one thing CRO has done to my personality, it’s heightened my level of skepticism. If anything looks too good to be true, I assume something went wrong. Actually, most of the time, I’m poking at prodding at things, seeing where they may have been broken or setup incorrectly. It’s an exhausting mentality, but one that is necessary when dealing with so many decisions.

Ever see those case studies that proclaim a call to action button color change on a web page led to a 100%+ increase in conversion rate? Almost certainly bullshit. If you see something like this, even if you just get a small itch where you think, “hmm, that seems…interesting,” go after it. Also second guess data, and triple guess yourself.

As the analytics legend Chris Mercer says, “trust but verify.”

And read about Twyman’s Law here.

10. Don’t shut off a variant mid test or shift traffic allocation mid test

I guess this is sort of related to two previous rules here: run your test for the full length and start by only testing one variant against the control.

If you’re testing multiple variants, don’t shut off a variant because it looks like it’s losing and don’t shift traffic allocation. Otherwise, you may risk Simpson’s Paradox.

Intermediate A/B Testing Issues: A Whole Lot More You Should Maybe Worry About

Control for external validity factors and confounding variables
Pay attention to confidence intervals as well as p-values
Determine whether your test is a Do No Harm or a Go For It test, and set it up appropriately.
Consider which type of test you should run for which problem you’re trying to solve or answer you’re trying to find (sequential, one tail vs two tail, bandit, MVT, etc)
QA and control for “flicker effect”
Realize that the underlying statistics are different for non-binomial metrics (revenue per visitor, average order value, etc.) – use something like the Mann-Whitney U-Test or robust statistics instead.
Trigger the test only for those users affected by the proposed change (lower base rates lead to greater noise and underpowered tests)
Perform an A/A test to gauge variance and the precision of your testing tool
Correct for multiple comparisons
Avoid multiple concurrent experiments and make use of experiment “swim lanes”
Don’t project precise uplifts onto your future expectations from those you see during an experiment.
If you plan on implementing the new variation in the case of an inconclusive test, make sure you’re running a two-tailed hypothesis test to account for the possibility that the variant is actually worse than the original.
When attempting to improve a “micro-conversion” such as click through rate, make sure it has a downstream effect and acts as a causal component to the business metric you care about. Otherwise, you’re just shuffling papers.
Use a hold-back set to calculate the estimated ROI and performance of your testing program

Intermediate A/B Testing Issues: Explained

1. Control for external validity factors and confounding variables

Well, you know how to calculate statistical significance, and you know exactly why you should run your test for full business cycles in order to capture a representative sample.

This, in most cases, will reduce the chance that your test will be messed up. However, there are plenty more validity factors to worry about, particularly those outside of your control.

Anything that reduces the representativeness or randomness of your experiment sample can be considered a validity factor. In that regard, some common ones are:

Bot traffic/bugs
Flicker effect
PR spikes
Holidays and external events
Competitor promotions
Buggy measurement setup
Cross device tracking
The weather

I realize this tip is frustrating, because the list of potential validity threats is expansive, and possibly endless.

However, understand: A/B testing always involves risks. All you need to do is understand that and try to document as many potential threats as possible.

You know how in an academic paper, they have a section on limitations and discussion? Basically, you should do that with your tests as well. It’s impossible to isolate every single external factor that could affect behavior, but you can and should identify clearly impactful things.

For instance, if you raised a round of capital and you’re on the front page of TechCrunch and Hacker News, maybe that traffic isn’t exactly representative? Might be a good time to pause your experiments (or exclude that traffic from your analysis).

2. Pay Attention to Confidence Intervals as Well as P-Values

While it’s common knowledge among experimenters that one should strive to call a test “significant” if the p-value is below .05. This, while technically arbitrary, ensures we have a certain level of risk in our decision making and it never rises above an uncomfortable point. We’re sort of saying, 5% of experiments may show results purely due to chance, but we’re okay with that, in the long run.

Many people, however, fail to understand or use confidence intervals in decision making.

What’s a confidence interval in relation to A/B testing?

Confidence intervals are the amount of error allowed in A/B testing – the measure of the reliability of an estimate. Here’s an example outlined by PRWD:

Basically, if your results, including confidence intervals, overlap at all, then you may be less confident that you have a true winner.

John Quarto-vonTivadar has a great visual explaining this:

Of course, the greater your sample size, the lower the margin of error becomes in an A/B test. As is usually the case with experimentation, high traffic is a luxury and really helps us make clearer decisions.

3. Determine whether your test is a Do No Harm or a Go For It test, and set it up appropriately.

As you run more and more experiments, you’ll find yourself less focused on an individual test and more on the system as a whole. When this shift happens, you begin to think more in terms or risk, resources, and upside, and less in terms of how much you want your new call to action button color to win.

A fantastic framework to consider comes from Matt Gershoff. Basically, you can bucket your test into two categories:

Do No Harm
Go For It

In a Do No Harm test, you care about the potential downside and you need to mitigate it or avoid it. In a Go For It test, we have no additional cost to making a Type 1 error (false positive), so there is no direct cost invoked when making a given decision.

In the article, Gershoff gives headline optimization as an example:

“Each news article is, by definition, novel, as are the associated headlines.

Assuming that one has already decided to run headline optimization (which is itself a ‘Do No Harm’ question), there is no added cost, or risk to selecting one or the other headlines when there is no real difference in the conversion metric between them. The objective of this type of problem is to maximize the chance of finding the best option, if there is one. If there isn’t one, then there is no cost or risk to just randomly select between them (since they perform equally as well and have the same cost to deploy). As it turns out, Go For It problems are also good candidates for Bandit methods.”

Highly suggested that you read his full article here.

4. Consider which type of test you should run for which problem you’re trying to solve or answer you’re trying to find (sequential, one tail vs two tail, bandit, MVT, etc)

The A/B test is sort of the gold standard when it comes to online optimization. It’s the clearest way to infer a difference between a given element or experience. Though there are other methods to learning about your users.

Two in particular that are worth talking about:

Multivariate testing
Bandit tests (or other algorithmic optimization)

Multivariate experiments are wonderful for testing multiple micro-components (e.g. a headline change, CTA change, and background color change) and determining their interaction effects. You find which elements work optimally with each other, instead of a grand and macro-level lift without context as to which micro-elements are impactful.

In my anecdotal experience, I’d say good testing programs usually run one or two multivariate tests for every 10 experiments run (the rest being A/B/n).

Bandit tests are a different story, as they are algorithmic. The hope is that the minimize “regret” or the amount of time you’re exposing your audience to a suboptimal experience. So it updates in real time to show the winning variant to more and more people over time.

In this way, it sort of “automates” the a/b testing process. But bandits aren’t always the best option. They sway with new data, so there are contextual problems associated with say, running a bandit test on an email campaign.

However, bandit tests tend to be very useful in a few key circumstances:

Headlines and Short-Term Campaigns (e.g. during holidays or short term, perishable campaigns)
Automation for Scale (e.g. when you have tons and tons of tests you’d like to run on thousands of templatized landing pages)
Targeting (we’ll talk about predictive targeting in “advanced” stuff)
Blending Optimization with Attribution (i.e. testing, while at the same time, determining which rules and touch points contribute to the overall experience and goals).

5. QA and control for “flicker effect”

Flicker effect is a very special type of A/B test validity threat. It’s basically when your testing tool causes a slight delay on the experiment variation, briefly flashing the original content before serving the variation.

There are tons of ways to reduce flicker effect that I won’t go into here (read this article instead). A broader point is simply that you should “measure twice, cut once,” and QA your test on all major devices and categories before serving it live. Better to be prudent and get it right than to fuck up your test data and waste all the effort.

6. Realize that the underlying statistics are different for non-binomial metrics (revenue per visitor, average order value, etc.) – use something like the Mann-Whitney U-Test instead of a Z test.

When you run an A/B test with the intent to increase revenue per visitor or average order value, you can’t just plug your numbers into the same statistical significance calculator as you would with conversion rate tests.

Essentially, you’re looking at a different underlying distribution of your data. Instead of a binomial distribution (did convert vs. didn’t convert), you’re looking at a variety of order sizes, and that introduces the concept of outliers and variance into your calculations. It’s often the case that you’ll have a distribution affected by a very small amount of bulk purchasers, who skew a distribution to the right:

In these cases, you’ll want to use statistical test that does not make the assumption of a normal distribution, such as Mann-Whitney U-Test.

7. Trigger the test only for those users affected by the proposed change (lower base rates lead to greater noise and underpowered tests)

Only those affected by the test should be bucketed and included for analysis. For example, if you’re running a test on a landing page, where a modal pops up after scrolling 50%, you’d only want to include those who scroll 50% in the test (those who don’t would never have been the audience intended for the new experience anyway).

The mathematical reasoning for this is that filtering out unaffected users can improve the sensitivity (statistical power) of the test, reducing noise and making it easier for you to find effects/uplifts.

Most of the time, this is a fairly simple solution involving triggering an event at the moment where you’re looking to start analysis (at 50% scroll depth in the above example).

8. Perform an A/A test to gauge variance and the precision of your testing tool

While there’s a constant debate as to whether A/A tests are important or not, it sort of depends on your scale and what you hope to learn.

The purpose of an A/A test – testing the original vs the original – is mainly to establish trust in your testing platform. Basically, you’d expect to see statistically significant results – despite the variants being the same – about 5% of the time with a p-value of < .05.

In reality, A/A tests often open up and introduce you to implementation errors like software bugs. If you truly operate at high scale and run many experiments, trust in your platform is pivotal. An A/A test can help provide some clarity here.

This is a big topic. Ronny Kohavi wrote a great paper on it, which you can find here.

9. Correct for multiple comparisons whenever applicable

We’ve talked a bit of about the multiple comparisons problem, and how, when you’re just starting out, it’s best to just run simple A/B test. But you’re eventually going to get curious, and you’ll eventually want to run a test with multiple variants, say an A/B/C/D/E test. This is good, and you can often get more consistent results from your program when you test a greater variety of options. However, you do want to correct for multiple comparisons when doing this.

It’s fairly simple mathematically. Just use Dunnett’s test or the Sidak correction.

You also need to keep this multiple comparisons problem in mind when you do post-test analysis on segments. Basically, if you look at enough segments, you’ll find a statistically significant result. The same principle applies (you’re increasing the risk of a false positive with every new comparison).

When I do post-test segmentation, I often use it more as a tool to find research questions than to find answers and insights to based decisions on. So if I find a “significant” lift in a given segment, say Internet Explorer visitors in Canada, I note that as an insight that may or may not be worth testing. I don’t just implement a personalization rule, as doing that each time would certainly lead to organizational complexity, and would probably result in many false positives.

10. Avoid multiple concurrent experiments and make use of experiment “swim lanes”

Another problem that comes with scale is running multiple concurrent experiments. Basically, if you run two tests, and they’re being run on the same sample, you may have interaction effects that ruin the validity of the experiment.

Best case scenario: you (or your testing tool) creates technical swim lanes where a group can only be exposed to one experiment at a time. It prevents, automatically, this sort of cross-pollination, and reduces sample pollution.

A scrappier solution, one more fit for those running fewer tests, is to run your proposed experiments through a central team who gives the green-light and can see, at a high level, where there may be interaction effects, and avoid them.

11. Don’t project precise uplifts onto your future expectations from those you see during an experiment.

So, you got a 10% lift at 95% statistical significance. That means you get to celebrate that win in your next meeting. You do want to state the business value of an experiment like this, of course – what’s a 10% relative lift mean in isolation – so you also include a projection of what this 10% lift means for the business. “We can expect this to bring us 1,314 extra subscriptions per month,” you say.

While I love the idea of tying things back to the business, you want to tread lightly in matters of pure certainly, particularly when you’re dealing with projections.

An A/B test, despite misconceptions, can only truly tell you the difference between variants during the time we’re running the experiment. We do hope that differences between variants expand past the duration of the test itself, which is why we go through so much trouble in our experiment design to make sure we’re randomizing our sample and testing on a representative sample.

But a 10% lift during the test does not mean you’ll see a 10% lift during the next few months.

If you do absolutely need to project some sort of expected business results, at least do so using confidence intervals or a margin of error.

“We can expect, given the limitations of our test, to see X more subscriptions on the low side, and on the high side, we may see as many as Y more subscriptions, but there’s a level of uncertainty involved in making these projections. Regardless, we’re confidence our result is positive and will result in an uptick in subscriptions.”

Nuance may be boring and disappointing, but expectation setting is cool.

12. If you plan on implementing the new variation in the case of an inconclusive test, make sure you’re running a two-tailed hypothesis test to account for the possibility that the variant is actually worse than the original.

One-tail vs. two-tail a/b testing. This can seem like a somewhat pedantic debate in many cases, but if you’re running an A/B test where you expect to roll out the variant even if the test is inconclusive, you will want to protect your downside with a two-sided hypothesis test.

Read more on the difference between one-tail and two-tail A/B tests here.

13. When attempting to improve a “micro-conversion” such as click through rate, make sure it has a downstream effect and acts as a causal component to the business metric you care about. Otherwise, you’re just shuffling papers.

Normally, you should choose a metric that matters to your business. The conversion rate, revenue per visitors, activation rate, etc.

Sometimes, however, that’s not possible or feasible, so you work on moving a “micro-conversion” like click through rate or improving the number of people who use a search function. Often, these micro-conversions are correlative metrics, meaning they tend to associate with your important business metric, but aren’t necessarily causal.

Increased CTR might not increase your bottom line (Image Source)

A good example is if you find a piece of data that says people who use your search bar purchase more often and at higher volumes than those who don’t. So, you run a test that tries to increase the amount of people using that search feature.

This is fine, but make sure, when you’re analyzing the data, that your important business metric moves. So you increased people who use the search feature – does that also increase purchase conversion rate and revenue? If not, you’re shuffling papers.

14. Use a hold-back set to calculate the estimated ROI and performance of your testing program

Want to know the ROI of your program? Some top programs make use of a “holdback set” – keeping a small subset of your audience on the original version of your experience. This is actually crucial when analyzing the merits of personalization/targeting rules and machine learning-based optimization systems, but it’s also valuable for optimization programs overall.

A universal holdback – keeping say 5% of traffic as a constant control group – is just one way to try to parse out your program’s ROI. You can also do:

Victory Lap – Occasionally, run a split test combining all winning variants over the last 3 months against a control experience to confirm the additive uplift of those individual experiments.
Re-tests – Re-test individual, winning tests after 6 months to confirm that “control” still underperforms (and the rate at which it does).

If you’re only running a test or two per month, these system-level decisions may be less important. But if you’re running thousands of tests, it’s important to start learning about program effectiveness as well as the potential “perishability” or decay of any given test result.

Here are a bunch of other ways to analyze the ROI of a program (just don’t use a simple time period comparison, please).

Advanced A/B Testing Issues – Mostly Fringe Cases That Some Should Still Consider

Look out for sample ratio mismatch.
Consider the case for a non-inferiority test when you only want to mitigate potential downsides on a proposed change
Use predictive targeting to exploit segments who respond favorably to an experience.
Use a futility boundary to mitigate regret during a test
When a controlled experiment isn’t possible, estimate significance using a bayesian causal model

Advanced A/B Testing Issues: Explained

1. Look out for sample ratio mismatch.

Sample Ratio Mismatch is a special type of validity threat. In an A/B test with two variants, you’d hope that your traffic would be randomly and evenly allocated among both variants. However, in certain cases, we see that the ratio of traffic allocation is off more than would be natural. This is known as “sample ratio mismatch.”

This, however, is another topic I’m going to politely duck out of explaining, and instead, link to the master, Ronny Kohavi, and his work.

He also has a handy calculator so you can see if your test is experiencing a bug like this.

2. Consider the case for a non-inferiority test when you only want to mitigate potential downsides on a proposed change

Want to run a test solely to mitigate risk and avoid implementing a suboptimal experience? You could try out a “non-inferiority” test (as opposed to the normal “superiority” test) in the case of easy decision tests and tests with side benefits outside of measurement capability (e.g. brand cohesiveness).

This is complicated topic, so I’ll link out to a post here.

3. Use predictive targeting to exploit segments who respond favorably to an experience.

A/B testing is cool, as is personalization. But after a while, your organization may be operating at such as scale that it isn’t feasible to manage, let alone choose, targeting rules for all those segments you’re hoping to reach. This is a great use case for machine learning.

Solutions like Conductrics have powerful predictive targeting engines that can find and target segments who respond better to given experience than the average user. So Conductrics (or another solution) may find that rural visitors using smartphones convert better with Variant C. You can weigh the ROI of setting up that targeting rule and do so, managing it programmatically.

4. Use a futility boundary to mitigate regret during a test

This is basically a testing methodology to improve efficiency and allow you to stop A/B tests earlier. I’m not going to pretend I fully grok this one or have used it, but here’s a guide if you’d like to give it a try. This is something I’m going to look into trying out in the near future.

5. When a controlled experiment isn’t possible, estimate significance using a bayesian causal model

Often, when you’re running experiments, particularly those that are not simple website changes like landing page CTAs, you may not be able to run a fully controlled experiments. I’m thinking of things like SEO changes, campaigns you’re running, etc.

In these cases, I usually try to estimate how impactful my efforts were using a tool like GA Effect.

It appears my SEO efforts have paid off marginally

Conclusion

As I mentioned up front, by its very nature, A/B testing is a statistical process, and statistics deals with the realm of the uncertainty. Therefore, while rules and guidelines can help reduce errors, there is no decision tree that can result in the perfect, error-less testing program.

The best weapon you have is your own mind, inquisitive, critical, and curious. If you come across a fringe issue, discuss it with colleagues or Google it. There are tons of resources and smart people out there.

I’m not done learning about experimentation. I’ve barely cracked the surface. So I may reluctantly come to find out in a few years that this list is naive, or ill-suited for actual business needs. Who knows.

But that’s part of the point: A/B testing is difficult, worthwhile, and there’s always more to learn about it.

Key Sources:

Also, thanks to Erik Johnson, Ryan Farley, Joao Correia, Shanelle Mullin, and David Khim for reading this and adding suggestions before publication.