The Limits of Big Data

ORDER REPRINTS DOWNLOAD COMMENT DISCUSS SHARE

Many questions turn out to be nonsense – but this is interesting.

But introduce a second variable, for example, economic conditions in early development, and the problem gets interesting. It is interesting precisely because there is a hidden attractor that is clustering the larger population. Suddenly the graph falls into two or perhaps three different populations – each with its own peak and distribution. (Remember the technical sales rep’s magic graph). Economic status and achievement scores are not independent variables. They are dependent variables. Our new model is "economic well-being in early schooling causes more students to get higher ability scores." The "model" is both a statement of our conclusion and a prediction of what we would find if we looked at more people, aka, larger data sets. The model tells us what we can do with the Big Data set. Almost everything interesting to find in Big Data is about discovering dependencies and turning them into a model of how things are and will be.

Not all dependencies are causal. Often we believe that if things are related and dependent then one causes the other. ‘Poor economic conditions cause students to perform poorly in school.’ That is not an easy statement to verify in the mathematics of Big Data. Things can be dependent because they are both the result of yet another hidden mover which causes each to different degrees. In our example above, what happens when motivation is added as a third variable? Many things interact to cause many different things: this is a simple description of "complexity".

The second component of the mathematical definition of truth is very subtle. It is the probable correctness of the model itself. A model allows you to ask questions of the data. This is equivalent to manipulating the graph parameters in our original anecdote.

It is possible to find correlations among data using a false premise. Elaborate models can be built that "explain" these dependencies. This is an issue: the questions you ask and the answers you expect can color interpretations. The history of science is the evolution away from false models to better models. This is the process Kuhn called Scientific Revolution. If the model is wrong, it can lessen the truth of the correlations found by that model. You can make the wrong business approach and get results wildly different from what you expect. You confidently can change a parameter in a router configuration file and the whole network can become unstable. But determining if a model is wrong is hard.

Many questions turn out to be nonsense – but this is interesting. This likely is because the model itself was wrong. It had a low probability of being true. It is very hard to determine this before experiencing a large group of predictive failures when using the bad model. When this occurs, it is "back to the drawing board" to create a new, better model. One rule of thumb in evaluating the truth, or accuracy, of a model is to examine the outlier data. In real data sets, some data just does not fit the expected value ranges predicted by the model. This data lies well of the curve of the graph. If it is randomized in its distribution, then likely it is noise and it might not contradict the existing interpretation of the data. However, if it is clustered off to itself, it means that is something the model has not addressed properly. Examining outliers and adjusting them to fit the new model, leads to truer models.

Still other questions have no answers and that itself leads to understanding of better models: “Which came first: The data or the math?” Existential questions help us understand the underlying nature of something which is not readily addressed by commonly used language. Following the analysis approach used in Big Data, neither the chicken nor the egg came first. Controlling for time, they are the same thing in different oscillating states of varying information content.

How Big is Big?

Today we are inundated with articles and headlines describing the explosion of data that our civilization is capturing. For example, IBM estimates that “Every day, we create 2.5 Quintilian bytes of data — so much that 90 percent of the data in the world today has been created in the last two years alone.” IBM appears to be talking about "data at rest"; that is, data placed in some form of storage. There is no way directly for us to comprehend this. This quantity, 2.5 Quintilian bytes, seems huge, but is a fraction of the total information content contained in any simple thing in your environment. Reality is still way denser than data.

It should be well understood that digital network traffic is simply transmitted information, so called "data on the move". Cisco relies on characterization statements to aid comprehension: “Globally, IP traffic will reach an annual run rate of 2.3 Zettabytes in 2020, up from an annual run rate of 870.3 Exabytes in 2015. Globally, IP traffic will reach 25 Gigabytes per capita in 2020, up from 10 Gigabytes per capita in 2015. Globally, average IP traffic will reach 592 Tbps in 2020, and busy hour traffic will reach 3.2 Pbps.” These characterizations of the data represent both a static measurement for a defined interval (aka 2015) and a projection of a future interval (aka 2020). It is a model. This gives us a mathematical formula for an expanding progression: in this case a compound annual growth rate of 22 percent. We, as strategists and executives, understand what to do about a growth rate.

Follow @PipelineWire