The Ultimate Guide to A/B Testing: Lessons from Ronny Kohavi
I recently watched a video titled The Ultimate Guide to A/B Testing | Ronny Kohavi. Kohavi, a seasoned expert on A/B testing, lays out critical lessons surrounding experimentation culture that resonate well with data-driven organizations.
Kohavi emphasizes an essential axiom: “Test everything.” Every code change or feature introduced should undergo an experiment because even small tweaks can yield surprising results. For instance, Kohavi cites examples from Bing where a minor change in ad display led to a stunning 12% revenue increase, equating to around $100 million at the time.
The concept of high-risk, high-reward ideas is pivotal as well. Kohavi argues that organizations need to be prepared to embrace failure—he suggests that failing 80% of the time while chasing major innovations is a norm. This perspective sets realistic expectations for teams venturing into A/B testing.
In terms of success rates, Kohavi’s experience showcases a spectrum of failure rates across organizations—66% at Microsoft, 85% at Bing, and 92% at Airbnb. This illustrates that most experiments will not yield positive outcomes. At Airbnb, out of approximately 250 experiments in search relevance, only 8% had successful impacts on key metrics. Such statistics underline the stark reality of experimentation.
The discussion naturally leads to the topic of p-values, a term that is frequently misunderstood. Kohavi clarifies that a p-value of 0.05 does not equate to a 95% probability that the tested treatment is better than a control. Instead, the p-value indicates how likely one would observe the collected data given the null hypothesis is true. Therefore, a significant p-value should prompt further scrutiny—it’s a reminder of Toyman’s Law, that any figure looking too good to be true usually is.
Kohavi also discusses the importance of establishing trustworthy experimentation processes. A well-structured experimentation platform decreases the marginal cost of experimentation, enabling organizations to run numerous tests without significant overhead. His recommendation is to invest in improving the infrastructure to support this because it ultimately enhances the speed and effectiveness of the experimentation.
Kohavi further highlights that institutions need strong documentation practices to maintain an archival memory of experiments—even the failures. This practice not only breeds a learning culture but also informs future endeavors. He encourages setting up quarterly reviews of surprising experiments, recognizing that the key to building a robust experiment culture lies in learning from both successes and unexpected failures.
The overall evaluation criterion (OEC) is another essential aspect of Kohavi’s framework. Organizations should not only optimize for a single metric—like revenue—but also incorporate countervailing metrics to ensure that other vital user-centric data, such as satisfaction and retention, does not suffer in the process. This perspective aligns experimentation with long-term business goals, ensuring sustainable growth.
Finally, the video is a reminder that in the dynamic landscape of experimentation, organizations should be prepared to iterate. Kohavi advises against embarking on large redesigns without incremental testing—start with small changes, measure the impact, and evolve based on concrete data. This iterative approach can shield organizations from the sunk cost fallacy, which often leads teams to commit to poor design outcomes.
For those interested in diving deeper into A/B testing, I strongly recommend checking out the video for a wealth of knowledge and insights. Kohavi’s methodology and experience underline that while the road to robust experimentation can be fraught with obstacles, the rewards, both in terms of data-driven decision-making and business outcomes, are invaluable.