by Phanish Puranam
Lori and Dan’s post “From Quasi-Replication to Generalization: Making ‘Basis Variables’ Visible” gives us a nice way to think about generalization in terms of “basis variables”. I’d like to extend their thought with a complementary way of thinking about generalization using machine learning (ML) techniques. Generalization can be thought of as the special case of replication of new contexts and so, it’s useful to first consider why results don’t replicate. If we set aside the ambiguity that results from operationalization and methodological (in)competence, there are two important reasons: sampling error and omitted variables, and the combination of the two.
Sampling error is often why other samples from the same context fail to replicate the result, raising suspicions that the original result may have been overfitted. This is where I think the discussion on the replication crisis in social psychology and medicine is centred today. Machine learning techniques solve the sampling error problem using regularization and cross validation. For more on this topic please see my paper on algorithmic induction with He, Sreshtha and Von Krogh, OS, 2020.
Another important part of the discussion on the replication crisis in social psychology and medicine needs to include omitted variables that moderate the key relationships, and whose values vary across contexts, They also explain why results may not hold in other contexts. The idea of context dependence, and limits to generalizability, is formally equivalent to the problem of unobserved moderators that vary by context (see for instance Bareinbom and Pearl, 2016). Dan and Lori recommend finding, explicitly measuring and theorizing about these moderators, which they term basis variables. In meta-analyses, this would be equivalent to finding and coding study level moderators (Hunter and Schmidt, 2003). That’s an unarguably good move.
In addition, we can extend it by thinking of entire clusters of basis variables. What if the structure of inter-relationships among an entire set of variables resembles each other in contexts A and B? For instance, think of the functional forms for gravitational pull as in Newton’s law, and electrostatic attraction as in Coulomb’s. They are very similar despite being in very different contexts. A more prosaic example is pay per use business models which surface in what superficially appear to be very different contexts, so that many relationships between strategy variables might generalize across these contexts. Prothit Sen at ISB and I have been working on a project to use ML to discover such “structural relatedness” between industrial contexts that no human analyst might ever stumble on unaided. We’ll keep you posted!