Abstract: Randomized controlled trials—often called A/B tests in industrial settings—are an increasingly important element in the management many organizations. Such experiments are meant to bring the benefits of scientific rigor and statistical measurement to the domain of managerial decision making. But just as this practice is starting to reach widespread adoption, the problem of 𝑝-hacking—by which experimenters try several statistical analyses until they find one that produces a sufficiently small 𝑝-value—has emerged as a prevalent concern among statisticians, industrial practitioners, and the scientific community at large. In A/B testing in particular, experimenters have the ability to watch their data arrive in real time and stop experiments once their 𝑝-values reach a given threshold of statistical significance. Such behavior, which is known to inflate false discoveries, can cause managers to make costly mistakes with economically significant consequences. In this paper, we attempt to study the prevalence of this form of 𝑝-hacking in a sample of 2,482 experiments from 245 e-commerce firms conducted on a third-party A/B testing platform. After developing a statistical method to detect this effect, we apply it to our data and find (across several specifications) little to no evidence for 𝑝-hacking in our sample of experiments. We use counterfactual simulations to determine that if a modest effect of 𝑝-hacking were present in our dataset, our methodology would have high levels of power to detect it at our current sample size. In addition to outlining a robust method for detecting 𝑝-hacking in similar datasets, our finding serves as a valuable data point in an increasingly important discussion on how economic agents use data and statistics for strategic decision making.
Read the working paper here.