In recent years the use of online A/B testing has skyrocketed, fueled by a growing appreciation of its value and by the relatively low cost of technology for conducting it. Today digital firms and, increasingly, conventional companies each run tens of thousands of online experiments annually measuring whether “A,” a control (usually the current approach), is inferior to “B,” a proposed improvement to a product, service, or offer. By quickly revealing users’ reactions to modifications, the experiments help firms identify the best ways to update digital products and create new ones. Because they push innovations out to a small, randomly selected group before they’re released to everyone, the tests also lessen the risk of unintended adverse side effects. And their unique ability to objectively measure the impact of a change enables firms to disentangle any growth in revenues, engagement, or other key business metric that improvements produce from growth that would have happened anyway. That vital information allows firms to spot opportunities and accurately assess their return on investment.
For many firms A/B testing is now an integral part of the product-development cycle. Decisions about when or whether to launch brand-new products or alter existing ones, whether or how to penetrate untapped markets or customer segments, and how to allocate capital to different areas of the business are all based on test results. It’s no exaggeration to say that successful A/B testing is critical to these firms’ future. But often companies make serious mistakes in conducting experiments. In our research at Harvard Business School and our experiences as data science leaders at Netflix and LinkedIn, we’ve identified three major pitfalls in the approaches companies take. In this article we describe how you can avoid those traps by applying techniques that have worked at Netflix and LinkedIn and that will help you use experiments more effectively to improve your firm’s performance.
Pitfall 1: Not Looking Beyond the Average
One common mistake is focusing on the impact that innovations have on the mean, or average, of the business metrics in question. By doing so, companies are essentially measuring the effect on a fictional average person and ignoring the enormous differences in how real customer segments behave. A change may cause use to jump up among one type of customer but totally turn off another type.
Imagine launching a new product that increases average user spending by $1. Our instinct is to assume that each user is spending an extra dollar. However, that increase could also occur if a few users started spending a lot more and all the others started taking their business elsewhere. Typical A/B testing dashboards, which report only a difference in the global average, don’t differentiate between those two scenarios.
Whenever core business metrics are dominated by a small number of large clients or superusers, averages are especially misleading. Unless decision makers stop thinking of their customers as one idealized, representative person, they risk optimizing for heavy users at the expense of light users. That’s dangerous because finding ways to get light users to increase consumption is often a firm’s biggest opportunity.
In some cases the answer might be to find the best single version (or, in experimentation lingo, “treatment”) for all users. But in others, it might make sense to create different versions tailored to the preferences of important segments of users. A/B testing can help companies do that. They can segment by using predefined groups such as country, industry, and past engagement, or by applying machine-learning techniques to identify groups that will respond differently to innovations. Even when not all insights are actionable, the test results allow firms to size up potential opportunities and discover ways to tap them.
To address the heterogeneity of customers, firms should do the following:
Use metrics and approaches that reflect the value of different customer segments.
Netflix wants to increase the benefits that it provides all its members—not just those who use its service the most. Consider what might happen if popular TV shows appeared more often in all users’ recommendations. That could induce frequent users to watch even more programming, dramatically increasing the average time users spent on Netflix. But this change doesn’t consider the needs of members who use Netflix to stream niche content and therefore may watch less overall. This is a problem: In general, less-engaged Netflix members don’t receive as much value from the service as heavy users do and are more likely to cancel their subscriptions. Therefore, increasing the amount of content that less-engaged members want to stream, even by a small percentage, is better for Netflix than getting heavy users to watch a few more hours of programming.
A change may make use jump among one type of customer but fall for another type.
To navigate such issues, Netflix takes two approaches. First, it usesA/B testing designs. In this technique Netflix alternates each user’s experience between “A” and “B”: The user receives the control experience on day one and the treatment on day two, or vice versa. That allows Netflix to identify the most-promising innovations while accounting for different members’ behaviors. Second, instead of looking at the raw average of streaming minutes, it has developed a metric that balances the effects on lightly and heavily engaged members and ensures that product changes don’t benefit one segment at the expense of another.
Measure the impact across different levels of digital access.
By “digital access” we mean whether customers have fast and reliable or slow and unstable internet connections; have the latest, most sophisticated devices or older or less powerful ones; and so on. Designing and analyzing A/B tests for those distinct cohorts allows you to match users with the experience that’s best suited to their digital environment.
For technical metrics (such as app loading times, delays before playback commences, and crash rates), it’s particularly important to understand individual members’ perceptions of how a modification affects the quality of their service. To do this, both Netflix and LinkedIn track the upper, middle, and lower percentiles of these metrics and how their mean values change. Has the treatment slowed app loading relative to the control for both users in the fifth percentile of loading times (those with the fastest internet connections) and users in the 95th percentile (those with the slowest)? Or has the treatment benefited users in the fifth percentile while harming those in the 95th percentile? Netflix uses this approach to test innovations aimed at improving the quality of streaming video playback for different devices and network connectivity conditions.
Always account for group-specific behavior.
LinkedIn’s A/B testing platform automatically computes the effects of experiments by group. For example, it separately calculates the impact of new features for each country, because something that works well in the United States may not be as successful in India. It also groups individuals by the reach of their social networks—since a communication enhancement will affect well-connected and sparsely connected individuals differently. In a recent initiative, for instance, LinkedIn found that instantly notifying job seekers of new job listings disproportionately increased the likelihood that sparsely connected individuals would apply for positions, because they’re less likely to hear of openings through other means than well-connected people are.
Finally, LinkedIn tracks the impact of changes on inequality itself by checking whether an innovation is increasing or decreasing the share of revenue, page views, and other top-line metrics contributed by the top 1% of users. This ensures that LinkedIn is not overoptimizing for the most active members at the expense of the less engaged.
Segment key markets.
Identifying country-specific differences has enabled LinkedIn and Netflix to continue serving their primary regions and grow into new ones without forcing the same experience onto all. In India, for instance, where people primarily access the internet via mobile devices, any initiative that slows down app loading speeds lowers engagement significantly more than it does in the United States and other markets where consumers aren’t as likely to rely on older mobile devices or slower 3G networks. Accordingly, to serve the needs of India and similar markets, LinkedIn developed the LinkedIn Lite version of its main application. To make it work more quickly, Lite has a lower image quality and a modified user interface, reducing the amount of data the app has to process. At Netflix, market-specific research on device usage has led the firm to experiment with and ultimately release a mobile-only membership plan for India.
Pitfall 2: Forgetting That Customers Are Connected
Standard A/B testing, in which you compare group A with group B, assumes there is no interaction between users in the two groups. That premise is often reasonable in traditional randomized experiments—such as clinical trials measuring the effectiveness of a new medicine. But interactions among the participants in an online A/B test can affect the outcome.
Consider an experiment that tests a change designed to make it easier to start a conversation with people in your LinkedIn network—for instance, by notifying you when someone is using LinkedIn at the moment or about contacts at a firm with a job you might be interested in and then letting you message them from the notification page. Since users without the update may receive and, in turn, respond to more messages (sent by people who got the update), there will most likely be a positive impact on the control group. If decision makers don’t take this “contamination” into account, severe mismeasurement can happen, which can lead to the wrong decision—such as concluding that a bad treatment is good or that a good one is bad. Here are ways to avoid that pitfall:
Use network A/B testing.
LinkedIn has developed techniques that allow it to measure the extent of group interactions or avoid them altogether. It does the latter by isolating users in group A from users in group B—by making sure that if a user is in group A, all the other users who could influence her behavior are also in group A. It then does the same with group B. These techniques capture a more detailed picture of user behavior. Consider a new content-recommendation algorithm that shows more long-form text content, such as news articles, and fewer images. Typically, images generate a lot of likes and a few comments, and news articles generate fewer likes but more comments. Users, however, are more likely to interact with or respond to content that one of their contacts has commented on than content the contact has merely liked. While a standard A/B test will show that the new algorithm is generating fewer likes, network A/B testing will capture both the likes and the positive downstream impact initiated by more comments from the exposed users. More broadly, network A/B testing has helpedthe total impact of their initiatives and on multiple occasions has led to significant changes in strategy.
Use time-series experiments.
These are A/B tests that randomly switch between exposing the whole market to treatment A and exposing it to treatment B. Online marketplaces where many buyers and sellers interact (such as platforms for online ad auctions or ride sharing) are particularly prone to contamination. There, even small A/B tests that target only some users can shift the market equilibrium in ways that don’t represent what would have happened had everyone been exposed to the change. Time-series experiments, however, can accurately gauge the true impact on the entire market.
Berenice Abbott/Getty Images
For example, imagine that LinkedIn develops a new algorithm for matching job seekers with job openings. To measure its effectiveness, LinkedIn would simultaneously expose all job postings and seekers in a given market to the new algorithm for 30 minutes. In the next 30-minute period, it would randomly decide to either switch to the old algorithm or stay with the new one. It would continue this process for at least two weeks to ensure that it sees all types of job search patterns. Netflix’s interleaving strategy is a special application of this more general methodology.
Pitfall 3: Focusing Predominantly on the Short Term
For A/B testing to be successful, experiments have to run for a sufficient period of time. Focusing solely on short-term signals can throw off a business for several reasons. First, the initial signal from a test is often different from the results seen once members grow accustomed to a new experience. This is particularly true of changes to user interfaces, where novelty, or “burn-in,” effects are common: Users will often show an initial heightened engagement with new features that decays over time. Second, innovations may lead to long-term but slow-to-materialize changes in how users interact with the product. For example, ongoing incremental improvements to recommendation algorithms or app performance may have no immediate measurable effects but may slowly but significantly increase customer satisfaction. Here’s how to account for these behaviors:
Get the length of experiments right.
You need to ensure that you’re measuring the steady-state impact of a new feature, as opposed to a short-term novelty effect. How long is long enough? That varies, because users respond differently to, say, a change to a user interface than they do to changes in a recommendation system. So you should run A/B tests until user behavior stabilizes. Both LinkedIn and Netflix monitor how engagement with new features evolves over time and have found that, for most of the tests they run, results typically stabilize after about a week.
Run “holdout” experiments.
In these a small subset of users is not exposed to the changes for a preset period of time (usually over a month) while everyone else is. This approach helps companies measure slow-to-materialize effects. LinkedIn has found that holdout experiments are beneficial in scenarios where the cumulative impact of many incremental changes may eventually lead to improvement or where it may take users a while to discover a new feature.
Imagine you’re testing a feature that highlights professional milestones (such as getting a new job) achieved by network connections in a social media feed. This feature would likely be triggered intermittently, perhaps only once or twice a week, depending on who is in a member’s network. In such cases an experimental period of several weeks or months may be needed to ensure that members of the treatment group are exposed to enough updates to test the feature’s effect on feed quality, or how relevant users perceive the content to be.
Online A/B testing offers a powerful way to gain insights into the impact of potential changes on different customer segments and markets. But standard approaches, which tend to focus on the short-term impact of a new experience on the average user, may lead companies to draw the wrong conclusions. The techniques we’ve described can help managers avoid common mistakes and identify the most valuable short- and long-term opportunities, both globally and for strategically important customer segments.