David Schwinn summarizes the entire history of statistics and warns about the dangers of some contemporary applications of what is known as Big Data.
This month’s column comes from a convergence of finishing my article, “Statistical thinking for OD Professionals,” for the OD Practitioner Journal, and my reading of “How Statistics Lost their Power – and Why We Should Fear What Comes Next” in the Guardian (https://www.theguardian.com/politics/2017/jan/19/crisis-of-statistics-big-data-democracy) and Weapons of Math Destruction by Cathy O’Neil (2016, New York: Crown Publishing Group). I modestly titled this column “A History of Statistics,” even though I have only six credit hours of statistics education. A better title might be “Dave’s Pretty True Story of the Evolution of Statistics.” Here goes.
The Guardian article led me to believe that once upon a time, rulers of countries and nations wanted to know what was going on in their countries. When they looked around and sent out scouts, they got very different observations. They had trouble making sense out of a lot of different views of the same country. They wanted a single “objective” view of the whole. Measuring and counting things seemed to be a good way to establish “objectivity.” A mathematician proposed getting a single view by finding a mathematical central tendency such as an average.
Those mathematicians went one step further. They also proposed understanding how much things varied. As they studied that variation, they noticed that things frequently varied in a way that could be graphically depicted by a bell-shaped curve. They logically called this a normal distribution. They even proposed calculating how much things varied. Along came the range. When they noticed that some distributions were not normally distributed, they concluded that maybe the range measure was too crude. They came up with the standard deviation and noticed that almost everything they attempted to measure or count fell within plus or minus three stand deviations of the average. Statistics was born!
While their statistical system was working pretty well for doing what it was designed to do…objectively describing a complex system like a country…they noticed that sometimes the stuff they were measuring or counting didn’t fall within their convenient six standard deviation spread. They decided to call these outliers and not to pay too much attention to them. After all, there were only a few of them at most.
The discipline of statistics was well on its way. Some mathematicians began calling themselves statisticians and they began doing much more sensitive analysis of other distributions beyond the normal one that started the whole thing. They even figured out how to estimate central tendencies and variabilities based on just a sample of all the stuff that was to be analyzed. The next step along the way, however, took a real turn.
In the 1920s, Dr. Walter Shewhart, a physicist, wanted to use statistics to help Bell Labs control the quality of the products they were producing. Shewhart starting tracking product measurements over time on a run chart. He noticed that sometimes things went haywire, and an adjustment would have to be made in the production process. One of the primary signals for when things went haywire was that outliers would appear. Rather than ignore the outliers, Shewhart found that he needed to react to the outliers to bring the process back to where it needed to be.
He also noticed that sometimes the workers would make adjustments intended to improve the process that, in fact, made it worse. He decided that if he could find a balance between these two kinds of errors, he could make the best products with the least effort. He and many others began conducting experiments to figure out the balance between these two kinds of errors (adjusting a process when nothing significant had changed and not making an adjustment when something weird happened that required a process adjustment). As a result of these experiments, Shewhart invented a new statistical tool, the control chart, and a new purpose for statistics. This new kind of statistics proved so valuable that a colleague of Shewhart’s, Dr. W. Edwards Deming, took it beyond the walls of the Bell Labs.
During World War II, Deming and others taught Shewhart’s techniques to the folks producing weapons for the war effort in America. They significantly helped improve the quality of American armaments while helping keep the cost of scrap and rework down. Deming called this new kind of statistics analytic studies as opposed to the descriptive and inferential studies that had been developed earlier to simply and objectively describe a system. The purpose of analytic studies was to guide decisions that would influence the future, a very different but very important purpose.
The next step in the evolution of statistics is called Big Data.
One of the things that occurred as we used statistics to describe and improve systems was to gather extra data just in case we needed it. Sometimes we did need it, but more often its existence caused us to do analyses that took us off track from our intended purpose. We decided to be more disciplined about when and where to analyze that “just in case” data, but we kept the data because we still might need it at another time. Now Big Data is helping us use that data, because that data is in our computer systems and our social media. Big Data taps into that data and, in some cases, creates new data. Although I am not a Big Data guy, Cathy O’Neil is.
O’Neil’s book, Methods of Math Destruction, indicates that Big Data looks for patterns and correlations. Because Big Data is big…and fast, it can seemingly find cause-effect relations quickly and decisively. For example, if I want to increase sales at my restaurant, I can search the data base of my customers for characteristics or behaviors that seem common among my customers. I might find something like most of my customers live within 10 miles of my restaurant, and make $40,000-$60,000 a year. I can then focus my advertising and promotions on people with those characteristics. There are, however, some Big Problems with Big Data.
When using analytic studies, correlations like the one above are usually done to verify cause-effect relations that we already believe to be true. We all know about correlations that do not indicate cause-effect relations. There is the old example that the more ice cream cones people eat in NYC, the higher the murder rate. That is a correlation, but not a cause-effect relationship. Big Data establishes the correlations without necessarily establishing cause-effect. In analytic studies, we usually substitute scatter plots for correlation analyses, because they are easier to understand. The analyses Big Data conducts are usually opaque because the firms that use them consider their analysis techniques proprietary; they believe that it gives them a strategic advantage over their competitors. Methods of Math Destruction expands on this shortfall and explains other shortfalls. It further summarizes the dangers of Big Data as being caused by:
- A Big Data model frequently does not clearly disclose its purpose, inputs, analysis techniques, outputs, and ability to test its own veracity and learn from that feedback. Clients and others usually know virtually nothing about how the Big Data supplier came up with its recommendation or how to test its success.
- Big Data sometimes harms the very people it is analyzing. A few examples will follow.
- Big Data is sometimes scalable in such a way that, while it may work in some application, it may cause major harm if applied in others.
One set of examples is how Big Data can use zip codes or other geographic descriptors to focus police concentration resulting in more arrests, frequently for minor offenses that send more citizens to jail, making it more difficult for them to get jobs, and, therefore, increasing the crime rate. Big Data has created similar death spirals around credit worthiness, employability, and educational success. Big Data has been used to rank colleges by “excellence.” We all know that excellence is an emergent property, one of Dr. Deming’s “most important” characteristics that are “unknown and unknowable.” Big Data created surrogate metrics to simulate “excellence” without really knowing the quality of that simulation. The colleges figured out the algorithm and spent resources to game the system knowing full well that there was no real cause and effect between the things the money was spent on and the education the students got.
In one Big Data application, FICO scores were used as a surrogate for trustworthiness, a desired employability attribute. If a person had a low FICO score, perhaps because of some short term medical or other personal emergency, they had trouble getting a job as a result of a Big Data model. If they couldn’t get a job, their FICO score declined…another death spiral.
Another example revolved around the “Nation at Risk” studies criticizing K-12 teachers because U.S. SAT scores were going down. Big Data forgot to factor in the fact that a higher proportion of students began taking the SAT. We are still paying a price for that Big Data error.
In Methods of Math Destruction, O’Neil argues that, left unchecked, Big Data can increase inequality and even threaten our democracy. I think she may be right. We are the statistical experts. Even though the Big Data providers will likely tell us that their black box is proprietary and too complex for us to understand, we must tell them that if that is so, it may be too dangerous for us to use. We are all smart enough to understand it and, I hope, ethical enough to keep their analytical secrets. I think Big Data can be a valuable tool when used with analytic studies, to improve systems, but also that it needs to be transparent, to do no harm, and to be applied at an appropriate scale.
As always, I treasure your thoughts and questions.