What’s the Ideal A/B Testing Strategy?

Last Updated on October 11, 2020 by Alex Birkett

A/B testing is, at this point, widespread and common practice.

Whether you’re a product manager hoping to quantify the impact of new features (and avoid the risk of negatively impacting growth metrics) or a marketer hoping to optimize a landing page or newsletter subject line, experimentation is the tried-and-true gold standard.

It’s not only incredibly fun, but it’s useful and efficient.

In the span of 2-4 weeks, you can try out an entirely new experience and approximate its impact. This, in and of itself, should allow creativity and innovation to flourish, while simultaneously capping the downside of shipping suboptimal experiences.

But even if we all agree on the value of experimentation, there’s a ton of debate and open questions as to how to run A/B tests.

A/B Testing is Not One Size Fits All

One set of open questions about A/B testing strategy is decidedly technical:

  • Which metric matters? Do you track multiple metrics, one metric, or build a composite metric?
  • How do you properly log and access data to analyze experiments?
  • Should you build your own custom experimentation platform or buy from a software vendor?
  • Do you run one-tailed or two-tailed T tests, bayesian A/B testing, or something else entirely (sequential testing, bandit testing, etc.)? [1]

The other set of questions, however, is more strategic:

  • What kind of things should I test?
  • What order should I prioritize my test ideas?
  • What goes into a proper experiment hypothesis?
  • How frequently should I test, or how many tests should I run?
  • Where do we get ideas for A/B tests?
  • How many variants should you run in a single experiment?

These are difficult questions.

It could be the case that there is a single, universal answers to these, but I personally doubt it. Rather, I think these answers can differ based on several factors, such as the culture of the company you work at, the size and scale of your digital properties, your tolerance for risk and reward, and your philosophy on testing and ideation. there’s some nuance based on the company you work at, where you are in terms of company size and resources, and your traffic and testing capabilities.

So this article, instead, will cover the various answers for how you could construct an A/B testing strategy — an approach at the program level — to drive consistent results for your organization.

I’m going to break this into two macro-sections:

  1. Core A/B testing strategy assumptions
  2. The three levers that impact A/B testing strategy success on a program level.

Here are the sections I’ll cover with regard to assumptions and a priori beliefs:

  1. A/B testing is inherently strategic (or, what’s the purpose of A/B testing anyway?)
  2. A/B testing always has costs
  3. The value and predictability of A/B testing ideas

Then I’ll cover the three factors that you can impact to drive better or worse results programmatically:

  1. Number of tests run
  2. Win rate
  3. Average win size per winning test

At the end of this article, you should have a good idea — based on your core beliefs and assumptions as well as the reality of your context — as to which strategic approach you should take with experimentation.

A/B Testing is Inherently Strategic

A/B testing is strategic in and of itself; by running A/B tests, you’re implicitly deciding that an aspect of your strategy is to spend the additional time and resources to reduce uncertainty in your decision making. A significance test is itself an exercise in quantifying uncertainty.

Image Source

This is a choice.

One does not need to validate features as they’re shipped or copy as its written. Neither do you need to validate changes as you optimize a landing page; you can simply change the button color and move on, if you’d like.

So, A/B testing isn’t a ‘tactic,’ as many people would suggest. A/B testing is a research methodology at heart – a tool in the toolkit – but by utilizing that tool, you’re making a strategic decision that data will decide, to a large extent, what actions you’ll take on your product, website, or messaging (as opposed to opinion or other methodologies like time series comparison).

How you choose to employee this tool, however, is another strategic matter.

For instance, you don’t have to test everything (but you can test everything, as well).

Typically, there’s some decision criteria as what we test, how often, and how we run tests.

This can be illustrated by a risk quadrant I made, where low risk and low certainty decisions can be decided with a coin flip, but higher risk decisions that require higher certainty are great candidates for A/B tests:

Even with A/B testing, though, you’ll never achieve 100% certainty on a given decision.

This is due to many factors, including experiment design (there’s functionally no such thing as 100% statistical confidence) but also things like perishability and how representative your test population is.

For example, macro-economic changes could alter your audience behavior, rendering a “winning” A/B test now a loser in the near future.

A/B testing Always Has Associated Costs

There ain’t no such thing as free lunch.

On the surface, you have to invest in the A/B testing technology or at least the human resources to set up an experiment. So you have fixed and visible costs already with technology and talent. An A/B test isn’t going to run itself.

You’ve also got time costs.

An A/B test typically takes 2-4 weeks to run. The period that you’re running that test is a time period in which you’re not ‘exploiting’ the optimal experience. Therefore, you incur ‘regret,’ or the “difference between your actual payoff and the payoff you would have collected had you played the optimal (best) options at every opportunity.”

Image Source

This is related to but still distinct from another cost: opportunity costs.

Image Source

The time you spent setting up, running, and analyzing an experiment could be spent doing something else. This is especially important and impactful at the startup stage, when ruthless prioritization is the difference between a sinking ship and another year above water.

An A/B test also usually has a run up period of user research that leads to a test hypothesis. This could include digital analytics analysis, on-site polls using Qualaroo, heatmap analysis, session replay video, or user tests (including Copytesting). This research takes time, too.

The expected value of an A/B test is the expected value of its profit minus the expected value of its cost (and remember, expected value is calculated by multiplying each of the possible outcomes by the likelihood each outcome will occur and then summing all of those values).

Image Source

If the expected value of an A/B test isn’t positive, it’s not worth running it.

For example, if the average A/B test costs $1,000 and the average expected value of an A/B test is $500, it’s not economically feasible to run the test. Therefore, you can reduce the costs of the experiment, or you can hope to increase the win rate or the average uplift per win to tip the scales in your favor.

A/B testing is a tool used to reduce uncertainty in decision making. User research is a tool used to reduce uncertainty in what you test with the hope that what you test has a higher likelihood of winning and winning big. Therefore, you want to know the marginal value of additional information collected (which is a cost) and know when to stop collecting additional information as you hit the point of diminishing returns. Too much cost outweighs the value of A/B testing as a decision making tool.

This leads to the last open question: can we predict which ideas are more likely to win?

What Leads to Better A/B Testing Ideas

It’s common practice to prioritize A/B tests. After all, you can’t run them all at once.

Prioritization usually falls on a few dimensions: impact, ease, confidence, or some variation of these factors.

  • Impact is quantitative. You can figure out based on the traffic to a given page, or the number of users that will be affected by a test, what the impact may be.
  • Ease is also fairly objective. There’s some estimation involved, but with some experience you can estimate the cost of setting up a test in terms of complexity, design and development resources, and the time it will take to run.
  • Confidence (or “potential” in the PIE model) is subjective. It takes into account the predictive capabilities of the individual proposing the test. “How likely is it that this test will win in comparison to other ideas,” you’re asking.

How does one develop the fingerspitzengefühl to reliably predict winners? Depends on your belief system, but some common methods include:

  • Bespoke research and rational evidence
  • Patterns, competitor examples, historical data (also rational evidence)
  • Gut feel and experience

In the first method, you conduct research and analyze data to come up with hypotheses based on evidence you’ve collected. Forms of data collection tend to be from user testing, digital analytics, session replays, polls, surveys, or customer interviews.

Image Source

Patterns, historical data, and inspiration from competitors are also forms of evidence collection, but they don’t presuppose original research is superior to meta-data collected from other websites or from historical data.

Here, you can group tests of similar theme or with similar hypotheses, aggregate and analyze their likelihood of success, and prioritize tests based on confidence using meta-analyses.

Image Source

For example, you could group a dozen tests you’ve run on your own site in the past year having to do with “social proof” (for example, adding micro-copy that says “trusted by 10,000 happy customers).

You could include data from competitors or from an experiment pattern aggregator like GoodUI. Strong positive patterns could suggest that, despite differences in context, the underlying idea or theme is strong enough to warrant prioritizing this test above others with weaker pattern-based evidence.

Patterns can also include what we call “best practices.” While we may not always quantify these practices through meta-analyses like GoodUI does, there are indeed many common practices that have been developed by UX experts and optimizers over time. [2]

Finally, some believe that you simply develop an eye for what works and what doesn’t through experience. After years of running tests, you can spot a good idea from a bad.

As much as I’m trying to objectively lay out the various belief systems and strategies, I have to tell you, I think the last method is silly.

As Matt Gershoff put it, predicting outcomes is basically a random process, so those who end up being ‘very good’ at forecasting are probably outliers or exemplifying survivorship bias (the same as covered in Fooled by Randomness by Nassim Taleb with regard to stock pickers)

Mats Einarsen adds that this will reward cynicism, as most tests don’t win, so one can always improve prediction accuracy by being a curmudgeon:

It’s also possible to believe that additional information or research does not improve your chance of setting up a winning A/B test, or at least not enough to warrant the additional cost in collecting it.

In this world of epistemic humility, prioritizing your tests based on the confidence you have in them doesn’t make any sense. Ideas are fungible, and anyway, you’d rather be surprised by a test you didn’t think would win than to validate your pre-conceived notions.

In this world, we can imagine ideas being somewhat random and evenly distributed, some winning big and some losing big, but most doing nothing at all.

This view has backing in various fields. Take, for instance, this example from The Mating Mind by Geoffrey Miller (bolding mine):

“Psychologist Dean Keith Simonton found a strong relationship between creative achievement and productive energy. Among competent professionals in any field, there appears to be a fairly constant probability of success in any given endeavor. Simonton’s data show that excellent composers do not produce a higher proportion of excellent music than good composers — they simply produce a higher total number of works. People who achieve extreme success in any creative field are almost always extremely prolific. Hans Eysenck became a famous psychologist not because all of his papers were excellent, but because he wrote over a hundred books and a thousand papers, and some of them happened to be excellent. Those who write only ten papers are much less likely to strike gold with any of them. Likewise with Picasso: if you paint 14,000 paintings in your lifetime, some of them are likely to be pretty good, even if most are mediocre. Simonton’s results are surprising. The constant probability-of-success idea sounds very counterintuitive, and of course there are exceptions to this generalization. Yet Simonton’s data on creative achieve are the most comprehensive ever collected, and in every domain that he studied, creative achievement was a good indicator of the energy, time, and motivation invested in creative activity.

So instead of trying to predict the winners before you run the test, you throw out the notion that that’s even possible, and you just try to run more options and get creative in the options you’ll run.

As I’ll discuss in the “A/B testing frequency” section, this accords to something like Andrew Anderson’s “Discipline Based Testing Methodology,” but also with what I call the “Evolutionary Tinkering” strategy [3]

Either you can try to eliminate or crowd out lower probability ideas, which implies you believe you can predict with a high degree of accuracy the outcome of a test.

Or you can iterate more frequently or run more options, essentially increasing the probability that you will find the winning variants.

Summary on A/B testing Strategy Assumptions

How you deal with uncertainty is one factor that could alter your A/B testing strategy. Another one is how you think about costs vs rewards. Finally, how you determine the quality and predictability of ideas is another factor that could alter your approach to A/B testing.

As we walk through various A/B testing strategies, keep these things in mind:

  • Attitudes and beliefs about information and certainty
  • Attitudes and beliefs about predictive validity and quality of ideas
  • Attitudes about costs vs rewards and expected value, as well as quantitative limitations on how many tests you can run and detectable effect sizes.

These factors will change one or both of the following:

  • What you choose to A/B test
  • How you run your A/B tests, singularly and at a program level

What Are the Goals of A/B Testing?

One’s goals in running A/B tests can differ slightly, but they all tend to fall under one or multiple of these buckets:

  1. Increase/improve a business metrics
  2. Risk management/cap downside of implementations
  3. Learn things about your audience/research

Of course, running an A/B test will naturally accomplish all of these goals. Typically, though, you’ll be more interested in one than the others.

For example, you hear a lot of talk around this idea that “learning is the real goal of A/B testing.” This is probably true in academia, but in business that’s basically total bullshit.

You may, periodically, run an A/B test solely to learn something about your audience, though this is typically done with the assumption that the learning will help you either grow a business metrics or cap risk later on.

Most A/B tests in a business context wouldn’t be run if there weren’t the underlying goal of improving some aspect of your business. No ROI expectation, no buy-in and resources.

Therefore, there’s not really an “earn vs learn” dichotomy (with the potential exclusion of algorithmic approaches like bandits or evolutionary algorithms); every test you run you’ll learn something, but more importantly, the primary goal is add business value.

So if we assume that our goals are either improvement or capping the downside, then we can use these goals to map onto different strategic approaches to experimentation.

The Three Levers of A/B Testing Strategy Success

Most companies want to improve business metrics.

Now, the question becomes, “what aspects of A/B testing can we control to maximize the business outcome we hope to improve?” Three things:

  1. The number of tests (or variants) you run (aka frequency)
  2. The % of winning tests (aka win rate)
  3. The effect size of winning tests (aka effect size)

1. A/B testing frequency – Number of Variants

The number of variants you test could be number of A/B tests or the number of variants in an A/B/n test – and there’s debate between the two approaches here – but the goal of either is to maximize the number of “at bats” or attempts at success.

This can be for two reasons.

First, to cap the downside and manage risk at scale, you should test everything you possibly can. No feature or experience should hit production without first making sure it doesn’t worsen your business metrics. This is common in large companies with mature experimentation programs, such as booking.com, Airbnb, Facebook, or Microsoft.

Second, tinkering and innovation requires a lot of attempts. The more attempts you make, the greater the chance for success. This is particularly true if you believe ideas are fungible — i.e. any given idea is not special or more likely than any other to move the needle. My above quote from Geoffrey Miller’s “The Mating Mind” illustrated why this is the case.

Image Source

Another reason for this approach is, according a shitload of studies (the appropriate scientific word for “a large quantity”) have shown that most A/B tests are inconclusive and the few wins tend to pay for the program as a whole, not unlike venture capital portfolios.

Take, for example, this histogram Experiment Engine (since acquired by Optimizely) put out several years ago:

Image Source

Most tests hover right around that 0% mark.

Now, it may be the case that all of these tests were run by idiots and you, as an expert optimizer, could do much better.

Perhaps.

But this sentiment is replicated by both data and experience.

Take, for example, VWO’s research that found 1 out of 7 tests are winners. A 2009 paper pegged Microsoft’s win rate at about 1 out of 3. And in 2017, Ronny Kohavi wrote:

“At Google and Bing, only about 10% to 20% of experiments generate positive results. At Microsoft as a whole, one-third prove effective, one-third have neutral results, and one-third have negative results.”

I’ve also seen a good amount of research that wins we do see are often illusory; false positives due to improper experiment design or simply lacking in external validity. That’s another issue entirely, though.

Perhaps your win rate will be different. For example, if your website has been neglected for years, you can likely get many quick wins using patterns, common sense, heuristics, and some conversion research. Things get harder when your digital experience is already good, though.

If we’re to believe that most ideas are essentially ineffective, then it’s natural to want to run more experiments. This increases your chance of big wins simply due to more exposure. This is a quote from Nassim Taleb’s Antifragile (bolding mine):

“Payoffs from research are from Extremistan; they follow a power-law type of statistical distribution, with big, near-unlimited upside but, because of optionality, limited downside. Consequently, payoff from research should necessarily be linear to number of trials, not total funds involved in the trials. Since the winner will have an explosive payoff, uncapped, the right approach requires a certain style of blind funding. It means the right policy would be what is called ‘one divided by n’ or ‘1/N’ style, spreading attempts in as large a number of trials as possible: if you face n options, invest in all of them in equal amounts. Small amounts per trial, lots of trials, broader than you want. Why? Because in Extremistan, it is more important to be in something in a small amount than to miss it. As one venture capitalist told me: “The payoff can be so large that you can’t afford not to be in everything.”

Maximizing the number of experiments run also deemphasizes ruthless prioritization based on subjective ‘confidence’ in hypotheses (though not entirely) and instead seeks to cheapen the cost of experimentation and enable a broader swath of employees to run experiments.

The number of variants you test is capped by the amount of traffic you have, your resources, and your willingness to try out and source ideas. These limitations can be represented by testing capacity, velocity, and coverage.

Image Source

Claire Vo, one of the sharpest minds in experimentation and optimization, gave a brilliant talk on this at CXL Live a few years ago:

2. A/B testing win rate

The quality of your tests matters, too. Doesn’t matter if you run 10,000 tests in a year if none of them move the needle.

While many people may think running a high tempo testing program is diametrically opposed to test quality, I don’t think that’s necessarily the case. All you need is to make sure your testing is efficient, your data is trustworthy, and you’re focusing on the impactful areas of your product, marketing, or website.

Still, if you’re focused on improving your win rate (and you believe you can predict the quality of ideas or improve the likelihood of success), it’s likely you’ll run fewer tests and place a higher emphasis on research and crafting “better” tests.

As I mentioned above, there are two general ways that optimizers try to increase their win rate: research and meta-analysis patterns.

Conversion research

Research includes both quantitative and qualitative research – surveys, heat maps, user tests and Google Analytics. One gathers enough data to diagnose what is wrong and potentially some data to build hypotheses as to why it is wrong.

See the “ResearchXL model” as well as mosts CRO agencies and in-house programs’ approach. This approach is what I’ll call the “Doctor’s Office Strategy.” Before you begin operating on a patient at random, you first want to take the time to diagnose what’s wrong with them.

Patterns, best practices, and observations

Patterns are another source of data.

You can find experiences that have been shown to work in other contexts and infer transferability onto your situation. Jakub Linowski, who runs GoodUI, is an advocate of this approach:

“There are thousands and thousands of experiments being run and if we just pay attention to all that kind of information and all those experiments, there’s most likely some things that repeat over and over that reproduced are largely generalizable. And those patterns I think are very interesting for reuse and exploitation across projects.”

Other patterns can be more qualitative. One can read behavioral psychology studies, Cialdini’s Influence, or just look at other company’s websites and take what they seem to be doing and try it on your own site.

Both the research and the patterns approach have this in common: they inherently belief that a certain quality and quantity of information you collect can lead to better experiment win rates.

Additionally, the underlying ‘why’ of a test (sometimes called the ‘hypothesis’) is very important in these strategies. In something like the Discipline-Based Testing Methodology, the narrative or the “why” doesn’t matter, only that it’s efficient and makes money. [4] [4.5]

3. Effect Size of A/B testing Wins

Finally, the last input is the effect size of a winning test. Patterns and research may help predict if a test will win, but not by how much.

This input, then, typically involves the most surprise and serendipity. It still requires that you diagnose the areas of exposure that have the highest potential for impact (e.g. running a test on a page with 1000 visitors is worse than running a test on a page with 1,000,000).

Searching for big wins also requires a bit of “irrational” behavior. As Rory Sutherland says, “Test counterintuitive things because no one else will!” [5]

The mark of a team working to increase the magnitude of a win is a willingness for trying out wacky, outside the box, creative ideas. Not only do you want more “at bats” (thus exposing yourself to more potential positive black swans), but you want to increase the beta of your options, or the diversity and range of feasible options you test. This is sometimes referred to as “innovative testing” vs. incremental testing. To continue the baseball analogy, you’re seeking home runs, not just grounders to get on base.

All of us want bigger wins as well as a greater win rate. How we go about accomplishing those things, though, differs.

CXL’s ResearchXL model seeks to maximize the likelihood of a winning test through understanding the users. Through research, one can hone in on high impact UX bottlenecks and issues with the website, and use further research to ideate treatments.

Andrew Anderson’s Discipline Based Testing Methodology also diagnoses high impact areas of the property, likely through quantitative ceilings. Though this approach ‘deconstructs’ the proposed treatments. Instead of taking research or singular experiences, this approach starts from the assumption that we don’t know what will work and that, in fact, being wrong is the best possible thing that can happen. As Andrew wrote:

“The key thing to think about as you build and design tests is that you are maximizing the beta (range of feasible options) and not the delta. It is meaningless what you think will win, it is only important that something wins. The quality of any one experience is meaningless to the system as a whole.

This means that the more things you can feasibly test while maximizing resources, and the larger the range you test, the more likely you are to get a winner and more likely to get a greater outcome. It is never about a specific test idea, it is about constructing every effort (test) to maximize the discovery of information.”

In this approach, then, you don’t just want to run more A/B tests; you want to run the maximum number of variants possible, including some that are potentially “irrational.” One can only hope that Comic Sans wins a font test, because we can earn money from the surprise.

Reducing the Cost of Experimentation Increases Expected Value, Always

To summarize, you can increase the value from your testing program in two ways: lower the cost, or increase the upside.

Many different strategies exist to increase the upside, but all cost reduction strategies look similar:

  • Invest in accessible technology
  • Make sure your data is accessible and trustworthy
  • Train employees on experimentation and democratize the ability to run experiments

The emphasis here isn’t primarily on predicting wins or win rate; rather, it’s on reducing the cost, organizationally and technically, of running experiments.

Sophisticated companies with data-driven culture usually have internal tools and data pipelines and centre of excellence programs that encourage, enable, and educate others to run their own experiments (think Microsoft, Airbnb, or booking.com)

When you seek to lower the cost of experimentation and run many attempts, I call that the “Evolutionary Tinkering Strategy.”

No one A/B tests will make or break you, but the process of testing a ton of things will increase the value of the program with time, and more importantly, will let you avoid shipping bad experiences.

This is different than the Doctor’s Office Strategy for two reasons: goals and resources.

Companies employing the Doctor’s Office Strategy are almost always seeking to improve business metrics, and they almost always have a very real upper limit on traffic. Therefore, it’s crucial to avoid wasting time and traffic testing “stupid” ideas (I use quotes because “stupid” ideas may end up paying off big, but it’s usually a surprise if so).  [5]

The “get bigger wins” strategy is often employed due to both technical constraints (limited statistical power to detect smaller wins) and opportunity costs (small wins not worth it from a business perspective).

Thus, I’ll call this the “Growth Home Run Strategy.”

We’re not trying to avoid a strikeout; we’re trying to hit a home run. Startups and growth teams often operate like this because they have limited customer data to do conversion research, patterns and best practices tend to be implement directly and not tested, and opportunity costs mean you want to spend your time making bigger changes and seeking bigger results.

This approach is usually decentralized and a bit messier. Ideas can come from anywhere — competitors, psychological studies, research, other teams, strikes of shower inspiration, etc. With greater scale, this strategy usually evolves into the Evolutionary Tinkering Strategy as the company becomes more risk averse as well as capable of experimenting more frequently and broadly.

Conclusion

This was a long article covering all the various approaches I’ve come across from my time working in experimentation. But at the end of the journey, you may be wondering, “Great, but what strategy does Alex believe in?”

It’s a good question.

For one, I believe we should be more pragmatic and less dogmatic. Good strategists know the rules but are also fluid. I’m willing to apply the right strategy for the right situation.

In an ideal world, I’m inclined towards Andrew Anderson’s Discipline-Based Testing Methodology. This would assume I have the traffic and political buy-in to run a program like that.

I’m also partial to strategies that democratize experimentation, especially large companies with large test capacity. I see no value in gatekeeping experimentation to a single team or to a set of approved ideas that “make sense.” You’re leaving a lot of money on the table if you always want to be right.

If I’m working with a new client or an average eCommerce website, I’m almost always going to employ the ResearchXL model. Why? I want to learn about the client’s business, the users, and I want to find the best possible areas to test and optimize.

However, I would also never throw away best practices, patterns, or even ideas from competitors. I’ve frustratingly sat through hours of session replays, qualitative polls, and heat maps, only to have “dumb” ideas I stole from other websites win big.

My ethos: experimentation is the lifeblood of a data-driven organization, being wrong should be celebrated, and I don’t care why something won or where the idea came from. I’m a pragmatist and just generally an experimentation enthusiast.

Notes

[1]

How to run an A/B test is subject for a different article (or several, which I’ve written about in the past for CXL and will link to in this parahraph). I’ve touched on a few variations here, including the question of whether you should run many subsequent tests or one single A/B/n tests with as many variants as possible. Other technical test methodologies alter the accepted levels of risk and uncertainty. Such differences include one-tail vs two-tail testing, multivariate vs A/B tests, bandit algorithms or evolutionary algorithms, or flexible stopping rules like sequential testing. Again, I’m speaking to the strategic aspects of experimentation here, less so on technical differences. Though, they do relate.

[2]

Best practices are either championed or derided, but something being considered a “best practice” is just one more data input you can use to choose whether or not to test something and how to prioritize it. As Justin Rondeau put it, a “best practice” is usually just a “common practice,” and there’s nothing wrong with trying to match customers’ expectations. In the early stages of an optimization program, you can likely build a whole backlog off of best practices, which some call low hanging fruit. However, if something is so obviously broken that fixing it introduces almost zero risk, then many would opt to skip the test and just implement the change. This is especially true of companies with limited traffic, and thus, higher opportunity costs.

[3]

This isn’t precisely true. Andrew’s framework explicitly derides “number of tests” as an important input. He, instead, optimizes for efficiency and wraps up as many variants in a single experiment as is feasible. The reason I wrap these two approaches up is, ideologically at least, they’re both trying to increase the “spread” of testable options. This is opposed to an approach that seeks to find the “correct” answer before running the test, and then only using the test to “validate” that assumption

[4]

Do you care why something won? I’d like to argue that you shouldn’t. In any given experiment, there’s a lot more noise than there is signal with regard to the underlying reasons for behavior change. A blue button could win against a red one because blue is a calming hue and reduces cortisol. It could also win because the context of the website is professional, and blue is prototypically associated with professional aesthetic. Or perhaps it’s because blue contrasts better with the background, and thus, is more salient. It could be because your audiences like the color blue better. More likely, no one knows or can ever know why blue beat red. Using a narrative to spell out the underlying reason is more likely to lead you astray, not to mention waste precious time storytelling. Tell yourself too many stories, and you’re liable to limit the extent of your creativity and the options you’re willing to test in the future. See: narrative fallacy.

[4.5]

Do we need to have an “evidence-based hypothesis”? I don’t think so. After reading Against Method, I’m quite convinced that the scientific method is much messier than we were all taught. We often stumble into discoveries by accident. Rory Sutherland, for instance, wrote about the discovery of aspirin:

“Scientific progress is not a one-way street. Aspirin, for instance, was known to work as an analgesic for decades before anyone knew how it worked. It was a discovery made by experience and only much later was it explained. If science didn’t allow for such lucky accidents, its record would be much poorer – imagine if we forbade the use of penicillin, because its discovery was not predicted in advance? Yet policy and business decisions are overwhelmingly based on a ‘reason first, discovery later’ methodology, which seems wasteful in the extreme.”

More german to A/B testing, he summarized this as follows:

“Perhaps a plausible ‘why’ should not be a pre-requisite in deciding a ‘what,’ and the things we try should not be confined to those things whose future success we can most easily explain in retrospect.”

[5]

An Ode to “Dumb Ideas”

“To reach intelligent answers, you often need to ask really dumb questions.” – Rory Sutherland

Everyone should read Alchemy by Rory Sutherland. It will shake up your idea of where good ideas (and good science) comes from.

Early in the book, Sutherland tells of a test he ran with four different envelopes used by a charity to solicit donations. They randomize the delivery of four different sample groups: 100,000 announce that the envelopes had been delivered by volunteers, 100,000 encouraged people to complete a form that meant their donation would be boosted by a 25% tax rebate, 100,000 were in better quality envelopes, and 100,000 were in portrait format. The only “rational” one of these was the “increase donation by 25%'” option, yet that reduced contributions by 30% compared to the plain control. The other three tests increased donations by over 10%.

As Sutherland summarized:

“To a logical person, there would have been no point in testing three of these variables, but they are the three that actually work. This is an important metaphor for the contents of this book: if we allow the world to be run by logical people, we will only discover logical things. But in real life, most things aren’t logical – they are psycho-logical.”