Last Updated on September 1, 2021 by Alex Birkett
I believe in the power of experimentation.
But most companies have stumbled tremendously in building powerful experimentation programs.
See, the value of experimentation doesn’t rest upon the single hyperbolic A/B test win.
The value is in building an experimentation program and culture that scales and helps cap risk, enable innovation and creative adaptation, and generate insights and learnings to enable a data-driven company culture.
Oversold and misunderstood, conversion rate optimization specialists and experimentation experts are often expected to come into a company and magically boost performance through sheer will and experience.
I’ve worked on experimentation at several companies now, including some where I started or formed the foundations of their experimentation programs.
This article will cover the 5 critical pieces you need in place to build out a program. While this won’t be as useful to companies that already have successful programs, it should be useful for those who are struggling to get one started.
Importantly, I’m also going to outline what elements are overrated or unnecessary when building out an experimentation program.
Preamble: on Expected Value and Marginal Utility
You need a certain amount of traffic and scale to warrant experimentation.
There are a set amount of actions you can take within a finite time horizon, therefore any action incurs an opportunity cost by replacing something else you could have otherwise done.
If you don’t have enough traffic, the expected value of your experimentation program will almost certainly be negative. Think about it at a high level: experimentation costs money. You have to hire program managers, designers, and developers (or partition some of their time to experimentation). You have to invest in tooling to accomplish this stuff. And the experiments you run incur an opportunity cost as well.
Let’s assume you can win 40% of your experiments – an incredible win rate. If the value of those wins doesn’t supersede the costs of the program, the expected value is negative.
Additionally, low traffic experiences are incredibly hard to work on. Functionally, it means you need to either accept higher levels of uncertainty in your results or work much more slowly to build out “no-brainer tests” (in which case, you might as well just launch them and not run the experiment).
Many companies have been sold on the value of experimentation, which is great. But not every company needs to hire and build out a whole program – yet.
The one caveat: if you’ve got sufficient runway and traction and you want to start building the experimentation “muscle” and culture, your program can “operate at a loss” as long as you understand that’s what you’re doing and you’re building for the long term.
The 2 Bottlenecks to Building Experimentation Programs
Broadly, the two things you need to figure out are:
- Technical challenges
- Cultural challenges
For the first category, this is the functional ability to run and analyze experiments. The amount of traffic we have as well as the ability of our tools to functionally randomize units will set a threshold on how many tests we can run and analyze. I’d also put human resources into this category, because you need someone to ideate, run, and analyze your experiments.
Cultural challenges are somewhat more nebulous. They involve education and enablement and visibility, buy-in and evangelizing from leadership, and also, human resources (yes, humans bridge both challenges).
You can’t build an experimentation program without getting both of these in order. Think about these themes as you read about the 5 pillars of experimentation programs.
The 5 Pillars of an Experimentation Program
- Trustworthy Data
- Human Resources
- Leadership and Strategic Alignment
- Experimentation Technology
- Education and Cultural Buy-In
1. Trustworthy Data
The heartbeat of experimentation is data. Bad or no data means no experimentation.
The process of experimentation isn’t about knowing what works on a user interface and applying it. It’s not about sprinkling on some social proof here, some authority and trust symbols there, and slapping on an urgency-provoking headline (though those could all be great tactics).
No, experimentation is a process for reducing uncertainty in decision making, thus capping your downside risk and enabling creative innovation.
It’s a methodology that uses data as feedback to quickly determine the efficacy of a treatment and make a decision based on that feedback.
So, logically, if your data / feedback is flawed, your decisions will be, too. And if your decisions resulting from experiments are flawed, you’re better off not running them (remember, expected value. There’s always a cost to experimentation and the resulting reward needs to outweigh that).
Unfortunately, bad or missing data is the most common problem I’ve seen when working with companies, either full time in experimentation roles, consulting for CRO, or through running my content agency. Everyone’s got a messed up Google Analytics setup!
That should give you pause, but also give you solace. You’re not alone. You just need to spend the time and resources to clean up your analytics – aka, invest in infrastructure – which many companies are unwilling to do.
This is one of those “slow down to speed up” steps.
Hire an analyst or at least a consultant. Have them determine the following:
- Are you tracking everything you need to be?
- Is the tracking precise or is it flawed?
- Is the data accessible to the right people at the right times?
- Are you integrating your data to fulfill a more holistic picture of user behavior?
By the way, these data audits aren’t a one-and-done type of thing. I like to revisit at least once a year, but preferably quarterly. And the best thing you can do is hire a directly responsible individual (DRI) to own your website or product analytics setup.
2. Human Resources
Many companies will think first of which CRO tools you need, but first, you need to figure out the people.
Get the right people, and the technology (while still important) doesn’t matter as much.
Who do you need?
It will massively depend on your context – industry, company, stage, resources, etc.
For early stage tech companies, most experimentation efforts can live within the broader growth program. As such, you can get a well-rounded T Shaped Marketer, someone who has taken Reforge or CXL courses, and can get you from zero to one.
But for anyone hoping to build a robust and long term program, I don’t think it’s easy to do without at least one data scientist or analytical support.
Make no mistake: xperimentation is hard.
There’s the strategic and program side of things, which should be filled by a PM or experimentation leader. There’s the technical side, which should be filled by dedicated growth engineers and designers who work with the PM or experimentation leader. And there’s the analysis side of things, and this is where so many people underinvest.
It’s also why most A/B test results are total bullshit – smoke and mirrors. It’s not due to malevolence; it’s because statistics and experimentation analysis are hard skills that take a long time to learn and apply.
I’ve been down this rabbit hole for years. I’ve written articles about A/B testing statistics and analyzed hundreds of tests. But I still prefer to have an actual analyst or data scientist to guide my decisions, because there are a million things that I don’t know and couldn’t know unless I dedicated my entire career to this knowledge set.
You’ve also got to decide how you want experimentation to sit in your organization. There are three common models:
- Centralized
- Decentralized
- Center of Excellence
Most people I’ve talked to seem to converge on the belief that the Center of Excellence model is the ideal end point, but it’s likely you’ll have to start out either centralized or decentralized.
I recommend reading Merrit Aho’s great article on how to structure an experimentation team. Optimizely also has a great article on this.
Whatever structure you choose, you’ll need to have some of the following people dedicated to experimentation (some can be outsourced or freelance if you don’t have in-house hires):
- Program leader / PM
- Analysts / data scientists
- Growth designers
- Growth engineers
- Growth marketing partners
3. Leadership and Strategic Alignment
If leadership isn’t strongly involved and bought-in on experimentation, the program is doomed to fail.
This includes a few components:
- Leaders must understand the value of experimentation and how it works
- Leaders must align on KPIs and motivators for the experimentation program
- Leaders must create conditions of psychological safety to allow for failure and tinkering
Two of these are summed up in Ben Labay’s great graphic on experimentation culture (trust the team and trust the goals):
Let’s look at the hypothetical opposite of the above components (which is quite common).
You get recruited to work for a company. Your title is growth manager, or experimentation manager, or CRO specialist. They recruited you because they read an HBR article on how booking.com and Microsoft are running experiments, or because they saw case studies on landing page optimization producing 300% conversion lifts.
So you come in and you’re dropped into a situation with no context, no KPIs, unclean data, and the expectation that you’ll get wins immediately. Losing tests are looked at as failures and they’re to be avoided. There’s no time to do customer research or audit your data; no, just go look at landing pages and tell us what to do differently.
Again, I want to reiterate: a CRO or experimentation expert isn’t a magician or landing page ninja. What we know is far outweighed by what we don’t know, and experimentation’s true value is unlocking unknown wins and rewards through risk-capped tinkering.
Let’s look at a better hypothetical situation:
You come in with the same title, but your VP of growth understands experimentation is a long game.
You’ve got resources to build out a team or hire agency or freelance resources. You’re given a window to explore the business context, set appropriate KPIs with the leadership team, and map out technological and process gaps in the system (for instance, you don’t have proper data logging and orchestration, so you prioritize that before headline tests).
You work cross-functionally with teams like product marketing, sales, customer success, brand, and product to understand the customer. You build a customer research protocol that includes UX panels, message testing, exploratory data analysis, and surveys.
This gets funneled into your new prioritization model, which is adapted to your organization’s specific capacity and needs. You weight dimensions like ease and impact, which you slot into your process based on engineering and design resources and testing capacity (calculated by traffic, conversions, and resource constraints).
Through this, you come up with clear quantitative KPIs and projects that will generate the most impact, and you also build out input (program) KPIs, such as conclusive test rates, time to production, and win rates.
Now, as you scale, you’ve got the statistical context to understand your test results, the business and customer context to run appropriate tests, and the resources to properly plan your testing program.
Andrew Anderson, a mentor of mine and head of optimization at ZenBusiness, explained the power of experimentation as the ability to generate impact at high leverage points, which requires leadership alignment and the ability to uncover unknown problems and solutions:
“What is important is that you have the ability to change the user experience at your highest scale points (Landing pages, home page, product page) and that you can track behavior to the single end goal that matters (in almost all cases it is RPV or leads). As long as you can do those parts and you can think in terms of what is possible and not in terms of just what you think will work, you will achieve great results.
Optimization can help take you in directions you never knew were important. It can make channels valuable that never were before, it can change how and where you interact with users, it can change what products matter and what doesn’t.
The only key is to let your users and the data tell you where to go and to not get too caught up on specific tactics or visions.”
Experimentation can’t be isolated from the broader business context; there’s no universal formula for CTA buttons, headlines, or landing page design. Leadership has to have the buy-in and understanding of the process.
If you don’t have this yet, I recommend having these difficult conversations with your leaders and managers:
- What are our resource constraints?
- What are our strategic goals?
- What do we know about the customer and what do we want to know?
- What will our experimentation program look like in 6, 12, and 18 months?
Here are a few A/B testing books that are written for business folks (less technical) that can help build context and understanding for the program:
- Experimentation Matters: Unlocking the Potential of New Technologies for Innovation
- The Power of Experiments: Decision Making in a Data-Driven World
- The Innovator’s Hypothesis: How Cheap Experiments Are Worth More than Good Ideas
These two articles are also great:
- The Surprising Power of Online Experiments (HBR)
- Get more wins: Experimentation metrics for program success (Optimizely)
Or just send them to a conference like CXL Live.
4. Experimentation Technology
Which specific A/B testing tool you choose isn’t important, but you’ve gotta have the tech stack to enable your efforts.
The experimentation tech stack, holistically, is quite important (and increasingly complex if you root it all the way down to your data collection). Nowadays, the tools that you’ll be using might entail:
- Data collection tools (Google Analytics, Snowplow, Heap, Wynter, HotJar, etc.)
- Data integration and database tools (Segment, Workato, Snowflake, etc.)
- Data accessibility tools (Data Studio, Tableau)
- Data cataloging tools (data.world, Confluence)
- Statistical analysis tools (R, Python)
- Experimentation platforms (Conductrics, Convert, Optimizely, VWO)
- Knowledge sharing tools (Effective Experiments, Notion)
- Project management tools (Airtable, Clickup, Notion)
I can’t pretend to be able to tell you which tools to choose – that’s why it’s so important to think through the people you need first. Hire the right program manager and they’ll be able to evaluate what you need for your specific situation.
Additionally, your experimentation technology should scale and mature with your program. It’s likely you’ll start out with minimal tooling, often using free or cheap tools (Google Optimize, Google Analytics, and HotJar are the common ones).
As you build trust and generate ROI, the experimentation program flywheel takes hold. You find bottlenecks in your stack, invest in a more robust system, and that helps you generate further ROI.
5. Education and Cultural Buy-In
Finally, if you really want your experimentation program to hum and not sputter, the company needs to be excited about it.
Not just you, lone wolf experimentation expert — you need to get sales, support, product marketing, campaigns, etc. stoked on experimentation.
Why?
Experimentation in isolation has a capped ceiling. There are only so many ways you can reorganize a landing page before you hit a local maximum.
To truly unlock the power of experimentation, you need to break down the barriers and experiment everywhere. You can’t do it alone. This is where the shift happens towards a center of excellence.
But with this shift comes the need for education and enablement; otherwise your company’s experimentation sophistication will be heterogeneous. Some teams will crush it, and some teams will run weakly powered button color tests.
So I look at this pillar in three sections:
- Cheerleading and evangelizing
- Enablement and support
- Education and improvement
Cheerleading and evangelizing
If a tree (experiment) falls in the forest (is run), but no one is around to hear it (no one knows the result), does it make a sound (does it matter)?
Problem: most people have no idea what the fuck experimentation means.
Solution: it’s your job to fix that.
Regular cross-functional meetings, weekly or monthly experimentation review newsletters, office hours and learning sessions, and really, really good reporting and data visualizations help.
So does just dropping the term “hypothesis” and “experiment” in your casual meetings.
You’ve gotta be a cheerleader for A/B testing, otherwise y’all will slink back into HiPPO-driven decisions and gut feel.
Enablement and support
Both when you’re running your centralized program and when you start to scale to supporting other teams, it’s important to build a standardized system for experimentation with guardrails and known processes.
Craig Sullivan added this comment on Ben’s LinkedIn thread:
“Democratisation (various things tied up in this but essentially freedom, autonomy, standardisation, shared methods, systems, data).”
Completely agree.
It’s one thing to house all this interesting experimentation in your own brain, but there’s a limit to how much you can get done. Scale yourself, and build repeatable processes that others can follow. Shift your thinking from being a player to being a coach.
Education and Improvement
Your team will grow, the company will hire new people, and new technologies and limitations will continually pop up.
The most adaptable teams with growth-mindsets as well as the tangible programs to fuel education win in the long term.
Personally, I think you can outsource a lot of this at this point. Sure, Airbnb has a Data University. But they’ve also got a ton of resources to spend on it.
You can probably just get a team account to CXL Institute.
You can also give an unlimited budget for books.
And you can set aside office hours and internal learning sessions.
But keep teaching and keep learning. That’s how you stay on top and become a top 1% experimentation program.
Ignore These Three Things (For Now…)
Everything comes down to expected value and marginal utility. Don’t run before you can walk. Eventually, the cool shit becomes valuable (or at least a cost you can afford). But first, get your room in order:
- Clean up your data.
- Hire the right people.
- Get your leadership bought-in and involved.
- Get your core tech stack in place.
- Educate and support those running experiments
Ignore the following until you’ve got that done:
Advanced Exploratory Data Analysis & Statistical Techniques
Matt Gershoff wrote an excellent article on what makes a useful A/B testing program.
He talks about what not to worry about, using the metaphor of mice and tigers:
“As for the mice, they are legion. hey have nests in all the corners of any business, whenever spotted causing people to rush from one approach to another in the hopes of not being caught out. Here are a few of the ‘mice’ that have scampered around AB Testing:
- One Tail vs Two Tails (eek! A two tailed mouse – sounds horrible)
- Bayes vs Frequentist AB Testing
- Fixed vs Sequential designs
- Full Factorial Designs vs Taguchi designs
There is a pattern here. All of these mice tend to be features or methods that were introduced by vendors or agencies as new and improved, frequently over-selling their importance, and implying that some existing approach is ‘wrong’. It isn’t that there aren’t often principled reasons for preferring one approach over the other. In fact, often, all of them can be useful (except for maybe Taguchi MVT – I’m not sure that was ever really useful for online testing) depending on the problem. It is just that none of them, or others, will be what makes or breaks a program’s usefulness.”
As you’re building up your program, it’s likely that there are fewer nodes or points of leverage than you think. Find them, and move the needle through a structured experimentation process.
Perhaps when you reach Facebook’s scale, you can worry about tracking literally everything, doing continuous exploratory data analysis, and uncovering every little insight through data analysis. Perhaps you can then worry about quasi-experimentation frameworks and the ideological differences between Bayesian and Frequentist statistics.
But in terms of cost vs benefits, it just doesn’t balance out for most programs.
Here’s how Anderson Anderson put it:
“There are a lot of things that are unnecessary or just not valuable at all to a program, no matter what phase it is in. These include a fancy testing tool, deep analysis of existing user patterns (which are only what the user is doing, not what the user should be doing), the greatest design in the world or even tracking on all parts of your website.
Micro conversions and empty analysis are just tools to make you feel better about a path, they don’t actually make the path that much more valuable. Even worse is thinking that optimization is just something you do once you have everything else squared away.”
As I’ve written about previously, too much data can be a real problem. It causes you to miss the forest for the trees and spend your time spinning your tires looking for insights.
Nassim Taleb actually summed it up best in Antifragile:
“More data – such as paying attention to the eye colors of the people around when crossing the street – can make you miss the big truck. When you cross the street, you remove data, anything but the essential threat.”
Personalization
If you haven’t locked away your A/B testing strategy, I’d put personalization on the backburner.
Personalization, the process at least, should be under the same strategic umbrella of your experimentation efforts. How you choose to target segments, design content and experiences, and adapt these overtime are all functionally parts of the same decision theoretic process you develop by running A/B tests.
So don’t run before you can crawl or walk.
Personalization introduces further complexity. Instead of managing a two part multiverse, you end up managing an N part multiverse made up of very small segments that are getting different treatments and experiences.
If you can’t get trustworthy data and run conclusive A/B tests, please try to ignore the hyperbolic marketing messages that personalization vendors are putting out. A/B testing isn’t dead, and no piece of personalization technology can save you from your existing weaknesses. In fact, when vendors use this kind of language, I’d use it as a negative signal.
Run the opposite direction when you hear news of a “silver bullet” solution.
The Latest Shiny Piece of Technology
In general, don’t get swept away by ephemeral hype cycles. Like content marketing, the fundamentals are what win ball games.
If you don’t have engineering resources to scope tests beyond button colors and headline tests, you probably shouldn’t invest in building a home brew A/B testing platform.
If you can’t learn what messaging resonates with your users, you probably don’t need predictive personalization or bandit algorithms.
The tool is an extension of the strategy, never the other way around. A lot of human resource opportunity cost is wasted on vendor demos for shit that won’t move the needle for you anyway.
Matt Gershoff puts it like this:
“At least to me, the biggest tiger in AB Testing is fixating on solutions or tools before having defined the problem properly. Companies can easily fall into the trap of buying, or worse, building a new testing tool or technology without having thought about: 1) exactly what they are trying to achieve; 2) the edge cases and situations where the new solution may not perform well; and 3) how the solution will operate within the larger organizational framework.”
Conclusion
I get that there’s no one way to build an experimentation program, that all companies differ in their scale, complexity, and context.
But you really can’t do experimentation without trustworthy data, human talent to run the program, and leadership and goal alignment. And while you need a tech stack that allows you to run experiments, chasing shiny new tools is a recipe for disappointment.
So I know the core foundational building blogs, and by process of elimination – via negativa – I know the things you should avoid when building up your program.
Truth is, a lot of success in the business world (and elsewhere) seems to come from really mastering the unsexy foundational building blocks. And I’ve outlined my ideas on those building blocks of experimentation here.
Do you disagree with me? Any additions? Please, comment below (or email me and debate me on my podcast about it).
Solid article laid out with brevity and key points to are easy to understand. I’ll share this article with my client and teammates. Lots to think about when creating a program.