Abstract: Firms often have many competing innovations that are possible for them to implement, but limited information about which one is likely to yield the highest return. An emerging strategy for evaluating various innovations is A/B testing, which gives firms a data-driven strategy for making strategic innovation decisions. In current practice, almost all A/B tests are conducted by performing simple t-tests across groups to compare the average treatment effects. However, there is a hidden shortcoming with this otherwise valid technique: while it is possible that version A outperforms version B on average, version B might actually outperform version A among the firm’s most valuable customers. By simply taking averages and ignoring this heterogeneity in customer response, managers are susceptible to choosing consequential business strategies that actually harm profits. In this project, we plan to both document this effect in archival data of real A/B tests and provide managers with insight into how to avoid this pitfall. Specifically, we hope to provide meaningful information about which types of tests are most susceptible to these adverse results and provide novel statistical evaluation methodologies that account for these subtle but consequential interaction effects.