Econometrics vs data mining
November 16, 2013
Before I worked professionally as a data-miner, I worked as a senior research economist at one of Australia’s premier microeconomics research bodies. Much of my time was spent building econometric models of the world.
I remember, when I made the move to work as a data-miner, one of my colleagues jokingly suggested that I was going over to the dark side. His was a reference to the philosophical difference between econometrics and data-mining.
Breiman on the ‘two cultures’
The late Leo Breiman is probably one of the key figures that developed the science of data mining (including developing the concept of a random forest). His 2001 paper Statistical Modelling: The Two Cultures contains an excellent discussion of the differences between econometrics and data mining.
He says:
There are two cultures in the use of statistical modelling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model (econometrics). The other uses algorithmic models and treats the data mechanism as unknown (data mining). The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.
This highlights the key difference between the two cultures - econometricians start with a model or view of how the world works (a data model), and try to ascertain the parameters that describe their model. Data miners, on the other hand, explicitly do not make assumptions about the data model or the way the world works but try to algorithmically build the best model using available data.
Which is more science-like?
I make the argument that data-mining is closer to real science than econometrics. Econometricians would possibly take offence at that statement, but let’s try to examine it objectively.
Allow me to use this definition of science (from Good math, bad math):
What science does is make observations, and then based on those observations produce models of the universe. Then, using that model, it makes predictions, and compares those predictions with further observations. By doing that over and over again, we get better and better models of how the universe works. Science is never sure about anything - because all it can do is check how well the model works. It’s always possible that any model doesn’t describe how things actually work. But it gives us a good approximation, in a way that allows us to understand how things work. Or, not quite how things work, but how we can affect the world by our actions. Our model might not capture what’s really happening - but it’s got predictive power.
To be clear - I am not saying that econometrics doesn’t apply this methodology - rather I am asserting that data mining is more effective at producing models of the universe based on observations, and therefore a more effective mechanism to understand how the world works.
Why is data mining more effective? Because it relaxes the assumption that the modeller knows how the world works, while this assumption forms the cornerstone of econometrics.