10 Ways to Wreck an Experimentation Program

Last Updated on November 12, 2021 by Alex Birkett

I’ve written a lot about how to create an experimentation program, improve data literacy, and promote a culture of experimentation.

Let’s talk about the opposite: how to sabotage an A/B testing program.

Why Not Focus on How to *Build* an Experimentation Program? Why Focus on the Negative?

I love looking at things through a via negativa lens. Instead of thinking “what can I *add* to make this a success,” you think, “what can I subtract?”

When I play tennis, if I focus too much on fancy moves and backspin, I mess up and I lose. If I focus on not making errors, I usually win.

In diet and fitness, there are a million ways to succeed (supplements, 204 different effective diets and exercise programs, etc. This leads to analysis paralysis and lack of action. When starting out, it might be best just to focus on avoiding sugar and injuries while working out.

I believe that if you simply avoid messing up your A/B testing program, the rest of the details tend to fall into place.

And by the way, I’ve written enough articles about how to build a culture of experimentation. It’s easy enough to tell you to “get executive buy-in” from my writer’s vantage point. My hardest lessons have come through learning from the following mistakes, however.

Love this Nassim Taleb quote on lessons by subtracting errors:

“I have used all my life a wonderfully simple heuristic: charlatans are recognizable in that they will give you positive advice, and only positive advice, exploiting our gullibility and sucker-proneness for recipes that hit you in a flash as just obvious, then evaporate later as you forget them.”

So here are some ways to destroy an experimentation program before it has a chance to take roots (avoid them and watch your program flourish).

1. “Quick Wins” Forever

One of the strongest red flags during a sales call for my agency is when someone expects instant results with content marketing.

Instant results are rare, and the only times they happen are through sheer dumb luck or because a company already has a huge audience and foundations for content and SEO set up.

But those companies aren’t the ones who expect quick wins. It’s early stage startups who *need* ROI tie-back and fucking fast.

The problems with this are two-fold:

  • First, content as a channel is simply a long game. It’s the only way to really win it.
  • Second, there are tradeoffs to chasing quick wins over sustainable growth.

These problems also correspond to experimentation programs.

Quick wins are great. We all love quick wins, and they certainly help you get more buy-in for your program.

Here’s the problem with them:

  • They don’t last forever
  • They’re an opportunity cost for more complex and potentially higher impact projects
  • They set poor expectations for the value of experimentation

The first problem is easy to explain: unless your website is total shit, your “quick wins” will run out fairly fast.

Image Source

One can run a conversion research audit and pick away most of the obvious stuff. Then you pick the likely stuff. Then you run out of obvious and likely stuff, and suddenly your manager starts wondering why your win rate went from 90% to 20%.

If we knew what would win, we wouldn’t need experimentation. We’d be clairvoyant (and extravagantly wealthy).

Experimentation, ideally, is an operating system and research methodology. which you make organizational decisions, whether it’s a change to your product or a new landing page on your website. Through iterative improvement and accumulation of customer insights, one builds a flywheel that spins faster and faster.

Image Source

Certainly, there are “quick wins” from time to time — namely, from fixing broken shit.

Eventually, broken shit gets fixed and you hit a baseline level of optimization. “Quick wins” dry up, but the expectation for them is still alive. When reality and expectation diverge, disappointment ensues.

One must start with a strong understanding of the value of experimentation and an eye for the long term.

If you’re hired in an experimentation role, sure, index on the obvious stuff at first. But make sure your executive team knows and understands that fixing broken buttons and improving page speed have a limited horizon, and eventually, one must wade into the uncertainty to get value from the program (especially at scale).

Similarly, constantly chasing quick wins is an opportunity cost in many cases. By focusing only on well-worn patterns, one makes the tradeoff by running fewer large experiments, investing in infrastructure, and doing customer research to build up new bases of knowledge.

Experimentation is like portfolio allocation, and some of it should be aimed at “quick wins,” but some of it should be bigger projects.

2. Rely on Industry Conversion Rate Benchmarks

Many misunderstand metrics when it comes to conversion rate optimization.

A conversion rate is a proportion. It depends both on the number of people that convert, but also, the number (and the composition) of the people that come to the website in the first place.

And that composition of people is fucking contextual.

Knowing that an average landing page conversion rate is 10% does nothing for your program. Even know that your closest competitor converts at 5% is completely meaningless information (and I mean *completely* — there is zero value in this).

Here you are, getting win after win and increasing your conversion rate month over month (I’ll talk later about the problems with that KPI), and then…bam! You’re hit with this industry report and you realize that, while your conversion rate has improved from 2% to 4%, the industry average is 8%. What now?

Image Source

I’ll just copy and paste something from CXL’s blog here:

“The average conversion rate of a site selling $10,000 diamond rings vs an ecommerce site selling $2 trinkets is going to be vastly different. Context matters.

Even if you compare conversion rates of sites in the same industry, it’s still not apples to apples. Different sites have different traffic sources (and the quality of traffic makes all the difference), traffic volumes, different brand perception and different relationships with their audiences.”

Here’s Peep’s solution, which I agree with:

“The only true answer to “what’s a good conversion rate” is this: a good conversion rate is better than what you had last month.

You are running your own race, and you are your own benchmark. Conversion rate of other websites should have no impact on what you do since it’s not something that you control. But you do control your own conversion rate. Work hard to improve it across segments to be able to acquire customers cheaper and all that.

And stop worrying about ‘what’s a good conversion rate’. Work to improve whatever you have. Every month.”

Caring about your industry conversion rate is also an opportunity cost and a diversion.

Focus on learning more about your customers, running more and better experiments, improving your own metrics, and innovating on your own business.

3. Staff Your Crucial Roles with Mercenaries and Interns

This is so common that it has become a trope in the growth space.

Executive goes to a conference, hears the importance of growth, hires a growth person. Expects the world with no resource allocation:

I’m all for scrappiness, but one must calibrate expectations or face sure destruction of the program in the long run (and talent burnout, too).

This expectation flows to any experimentation-centric role.

That’s why it’s important not to hire an experimentation person too early. There’s a lot you need in place to get value from experimentation:

  • Traffic adequate for running experiments and getting ROI from the program
  • Data infrastructure adequate for tracking and quantifying effects
  • A tech stack capable of running experiments and integrating with your other tools
  • Design and development to actually run worthwhile tests

If you have none of that and expect your one sole experimentation person to fill in the gaps in all of those areas, you’re going to be disappointed in the results and the experimentation person is going to leave through burnout or because they found a more serious company and organization.

Just look at how much a conversion rate optimization process entails:

Image Source

Data is the heartbeat of experimentation. Get a professional to orchestrate your analytics.

Real designers and real developers open up new worlds when it comes to testable solutions. Don’t make your experimentation person hack together shitty javascript or just run copy tests all day.

Get serious with investment in the team, otherwise it will never hit escape velocity.

Or just augment your team with agencies. There are tons of good ones.

4. Document Nothing

I’ve had multiple experiences where I came into a company to run experiments, built out some hypotheses and a roadmap through customer research and heuristic analysis, and presented the plan.

“Oh, we’ve already run [X Test] before. It didn’t work.”

Alright, when was it run? What were the results? Do you have the statistics / creative / experiment doc?

“A few years ago. Nope, we don’t have any of that.”

Well, shit. What now?

Personally, I hate doing documentation. It feels like busywork, but it’s not.

Writing an experiment document out in advance helps you plan a test, from the statistical design to the creative to the limitations. Writing and storing the results helps you cement and communicate learnings at the time, and storing them in an archive or knowledge base helps everyone else (including you) remember what you tested and learned from the test.

And if you’re not learning from your tests — wins, losses, inconclusives — you’re not really testing.

And if you have any hopes of scaling your experimentation program you’ll eventually hire new people. Help them out. They weren’t here 4 years ago and have no idea what was tested. Give them an Airtable table or something (better yet, use a tool like Effective Experiments).

Image Source

Similarly, if marketers are testing something, product managers could learn something from that test. If you’re not documenting it, you’re probably doing redundant work.

The tools are free, the templates are available. If you don’t use them, you’re just lazy.

5. Goal (Only) On CVR Increases and Winning Tests

I’ll admit, experimentation program goals and KPIs are hard to determine.

Goal setting in general is hard. When setting goals, I look for KPIs that are:

  • Useful versus overly precise and complex.
  • Not easily gamed.
  • Not burdened with strategic trade offs.

Imagine a sales team goaled only on meetings booked.

Well, in that case, the metric is useful and clearly discernible, but has clear strategic trade offs. One can easily book a ton of worthless meetings that end up eating up sales reps’ time but produce no actual sales or ROI.

In experimentation, two of the most common metrics used to judge a program are:

  • Increases to baseline conversion rate
  • Number of winning tests

For the first metric, there are many, many problems.

Imagine your website conversion rate is 5%. We’ll ignore the fact that conversion rate data is non-stationary and might fluctuate by a percentage point depending on the month or season.

Now imagine your company raises $100 million. You’ll probably get a lot of media attention from Hacker News, Tech Crunch, Wall Street Journal, whatever. This will result in traffic, let’s say an extra 50,000 visitors.

These 50,000 visitors convert at 1/10 the rate of your normal traffic. This lowers your baseline conversion rate despite winning every test that quarter (also unlikely, we’ll get to that).

Because your conversion rate decreased, is your program failing?

Fuck no. You’re doing well. You’re winning tests, incrementally moving the needle. All that press actually brought marginally more leads, but at a cost to the proportion metric. It’s an ‘external validity factor,’ a confounding variable. It shouldn’t make executives disappointed in the wins you’ve gotten.

Similarly, you could probably increase your conversion rate by turning off campaigns and traffic sources that are producing lower than average conversion rates, at the cost of the leads those campaigns bring in. Is this a good thing? Nope. It’s costing your business.

Now, winning tests. Better metric because it has more signal and less noise than conversion rate increases.

However, winning tests have an incentive problem. Namely, if you’re only incentivized to produce winning tests, and losers are punished, two emergent behaviors are likely:

  • You’ll test “safer” items — “low hanging fruit” — at the cost of more innovative and risky experiments
  • In some cases, teams will cherry pick data and run tests in a way that makes them appear as “winners”

The latter can be mitigated through good processes and guardrails (i.e. setting uncertainty thresholds and experiment QA checklists, having an independent analysis with different goals, etc.).

But the first is a real concern, especially after you’ve passed the point of adequate optimization. How do you move the needle when you’ve already fixed all the broken shit? Well, you have to try some riskier shit. Which means some tests will lose big. And you have to be okay with that.

After all, that’s a core value of experimentation. You limit the downside, which enables uncapped upside through risk mitigation. A losing variant only loses for 2-4 weeks, but the learning resulting from that could be game changing.

As for program KPIs that work, it really depends on the program. Ben Labay, Managing Director at Speero, told me it’s really a matter of improving test velocity and test quality.

I also like program metrics, like number of tests run, win rate, and win per test.

But it depends at what scale your program is operating already and what your experimentation strategy is.

6. Ignore Experiment Design and Statistics

Here’s my thinking:

A bad test is worse than no test.

If you don’t run a test, you’re inherently saying that you’re okay using your gut and taking that risk. You can approach it with some humility. There is no data, therefore, let’s just go with our gut.

But when you run a bad test, you have none of the certainty involved with proper statistics, but you have all of the confidence that you’ve made a “data-driven decision.”

If you’re just starting out in experimentation, you’ll likely need to run a few bad tests. Learn as you go. Makes sense, especially if you don’t have data science resources.

But if you’re serious about your program and want to be an experimentation-centric company like booking.com or Netflix, it might pay to invest in some data literacy:

It’s a messy process, so you’ll never fully iron our statistical noise and mistakes. But you can do a lot to eliminate the most common mistakes.

7. Don’t Question or Invest in Your Data Infrastructure

I’ve done a lot of consulting and I’ve worked at several companies.

I’ve never seen a perfect Google Analytics setup.

In many of my roles, I’ve spent up to 50% of my time working on data infrastructure. If you can’t measure it, you can’t experiment on it. And if your data is not trustworthy, your experiments won’t be trustworthy.

And if you can’t trust your experiments, you probably shouldn’t run them. It costs time and money to run experiments, and if you can’t trust the results, what are you really gaining from running them?

Number one, expect to spend a lot of time and money investing in your data infrastructure. If you’re not prepared to do this, you’re not ‘data-driven,’ you’re just talking the talk (AKA lying, even if just to yourself).

And even after investing in data infrastructure and talent, question the numbers you see. If something looks wrong, chances are, it is. As my friend Mercer always says, “trust, but verify.”

8. Change Strategy Frequently

Sometimes companies say they value experimentation, but what they actually mean is they flip flop on strategy constantly and have no vision.

They misappropriate the word “experiment” as many people do. Sometimes people mean “try something new” when they say experiment. Sometimes people mean they don’t have a strategy, so they pivot fast.

Image Source

Regardless, experimentation is best used as scaffolding to support a cogent business strategy.

Experiments can and should help bolster, alter, or change strategies when necessary.

But if the strategy itself is constantly changing, your experimentation team will never have the runway it needs to invest in longer team projects, to accumulate incremental wins, or to learn enough about particular UI features or customer segments to exploit their knowledge.

Experimentation *does* decouple strategy from top down planning, in a way; or rather, it sets limits and checks on managerial decisions led by gut feel. It kills the HiPPO.

But it doesn’t replace the need for strategy and vision, which effectively communicates the long term game plan as well as what you’re not going to do.

Strategy should guide the direction of experimentation and vice versa. But to change strategy frequently in absence of very good justification just results in whiplash and disappointment.

9. Cherry Pick Results

If you do it right, an experiment can be interesting beyond the aggregate delta in conversion rate or your metric of choice.

Even if you’re trying to improve, say, average conversion rate on an ecommerce site, you’ll still learn other stuff.

You’ll learn if a given segment responds more or less favorably to the treatment. You’ll learn if there are any tradeoffs with an improved conversion rate (does it, for example, reduce average order value?). And you’ll learn what effects a treatment has on varying user behavior signals like repeat visits, engagement, and pageviews.

You cannot, however, pick and choose which of these metrics determines if an experiment was a winner *after the fact.* That’s called “HARKing” (hypothesizing after results are known). It’s the “Texas sharpshooter fallacy,” painting a bullseye only after the shots have landed in the side of the barn.

Image Source

It’s easy to make any experiment or campaign look like a winner if you search far and long enough for a metric that looks favorable.

Do this consistently, and you’ll reduce the value of experimentation to rubble. It will then be used as validation for what you already wanted to do, which increases the cost of experimentation, lowers the value, and in turn, destroys the expected value of the program.

I’ve given this example a few times before, and it’s an older one, so forgive me. Buffer wrote about A/B testing and gave this example:

Version A:

Version B:

Now, without the “top tweet” pin, I couldn’t tell you which of those was the winner. There are five metrics, and some are winners one variant A and some on variant B (sample size also looks somewhat small, but that’s besides the fact).

Anyway, they somehow chose Version B, I suppose because of higher retweets and mentions. But what if Clicks was the metric that mattered?

People cherry pick metrics for two reasons:

  • Ignorance
  • Incentives

The first example is simple enough to combat. Invest in data literacy and processes that mitigate these mistakes.

The second is a matter of goals and incentives. If you have a culture of “success theater” that promotes only winning campaigns, surely you’ll start only hearing about winning campaigns. People aren’t incentives to fail or show their failures, they’re incentives to become spin doctors, messengers with only good news.

This is bad news. Create a culture where failure is expected, but you learn from it and improve based on what you learn. That’s a huge value of an experimentation program.

Who hasn’t heard someone justify a campaign that failed to bring revenue by saying something like “yeah, but it was still worth it because it raised our brand awareness / engagement / etc.”

If brand awareness was the goal up front, fine. But choosing that as a justification after the fact just obfuscates the experiment design and therefore learning.

10. Keep a Tight Grip on Experimentation at Your Company

Unfortunately, the worst thing you can do for an experimentation program is dictate it entirely from the top down.

You probably have good intentions, and in the beginning days, *someone* needs to dictate what is tested.

But people are inspired by autonomy and mastery, and when you take those two things away from them, you demotivate the team and people start going through the motions. Not only that, but you bottleneck the idea and insights flow, resulting in a narrower range of experiments.

It may be tempting as the CEO, CMO, VP, whatever, to tell your team what they should and shouldn’t test. But resist. Give some ownership and trust.

Beyond that, experimentation ideally expands beyond your own team.

Here’s the path I often see experimentation programs follow: Decentralized > Centralized > Center of Excellence

Image Source

In the decentralized model, individual teams and people start running experiments with little or no oversight, guardrails, or strategy. Someone takes a Reforge or CXL class, wants to run tests, and starts doing it.

Executives then realize the value of experimentation, so they hire or spin up a specialized team. Sometimes this is called a CRO team, sometimes a growth team, sometimes an experimentation team. They’re focused on experimentation and optimization and own all efforts. They’re solely accountable and responsible for experiments, no one else can run them.

At this stage, you eliminate many errors with experiments, but you cap the value. To become an experimentation driven organization, you need to democratize, support, and enable other teams to run experiments.

This leads to the center of excellence model, where you have a centralized specialist team of experimenters and data scientists who own and manage the tools, processes, and culture around experimentation, and they enable and educate others. They become cheerleaders for experimentation in a way. Their focus moves away from individual experiments to helping others get up and running autonomously.

When I say that CRO is an operating system, this is what I mean. To unlock the value of experiments, it can’t be bottlenecked inside one person or team’s brain. It has to be a methodology by which any team can make better decisions using data and controlled trials.

This is how the best programs in the world – Microsoft, Booking.com, Netflix, Shopify, etc. — operate.

Image Source

Conclusion

Unfortunately, there are more ways to ruin an experimentation program than there are to build and maintain one.

To build an maintain one, you need just a couple things:

  • Highly motivated individuals
  • Executive buy-in and understanding
  • Sufficient traffic
  • Sufficient budget

Have those, and the rest will fall into place.

However, an experimentation program can be derailed by seemingly subtle things, like cherry picking results, rewarding only winning tests, choosing the wrong metrics, or underinvesting in resources and infrastructure.

You may think “I’ll hire a specialist and give them Google Optimize, and that’s enough,” but it’s not. Experimentation is inherently difficult and cross-functional. It’s a garden that requires nourishing, but if you water it and care for it, it’ll be a perennially productive asset for you.