The Limits of Big Data

By: Wedge Greene

A Particularly Hard Problem

Big Data. What is it and what can we do with it? This is a practical imperative we must answer in our ICE business environment. We gather huge amounts of data because we can. Yet much of it may turn out to be of no value because we don’t know what questions to ask. This is unacceptable. So who do you need to hire and what tools do they need to use to bring value from Big Data? But to get these answers, I must ask you to take an intellectual walk with me…

I was introduced to the current market trend of Big Data analysis almost 10 years ago when a fledgling startup asked us for help creating a value message for their company. They were expanding from finance into the telecom industry. There was then no ‘Big Data’ marketing term. The  two person team included accomplished technical sales support. We watched as he manipulated complex 3D graphics where he teased hidden information from a large data set of many variables. It seemed that he was discovering new trends and associations just by changing, or as they described it, “playing with” the plotting parameters. New discoveries occurred almost faster than I could follow. Of course it was prestidigitation. Quite practiced skill and familiarity with both the product and information present in the data set. The product was not magic - but the technical rep was a stage magician.

All the complex analysis must have been done before. Because Big Data is a seriously hard subject.  Teasing out hidden signals among a realistic, noisy data set calls for complex analysis.  And not every signal that emerges is true, or rarer still, meaningful. Performing an accurate and "true" (in the technical meaning of the mathematics) analysis can lead to original insights contained in the behavior of Big Data clusters - sometimes quite profound revelations. Before Big Data entered the commercial market, it was the tool of physics, geology, biology, and social science, with less true side trips into economics. In science, teasing out a meaningful hidden signal in the data was always worth a peer-reviewed publication. Sometimes even a prize.  For example, finding and confirming a new subatomic particle is a Big Data problem. In business, it can mean new tools to leverage markets.

Today we are pitched Big Data analysis products that can “cross reference” data sets from different company domains and find important information that will let executives (or sales directors, or NOC engineers) find a new market opportunity, or diagnose a difficult, persistent service issue. Products today can actually do this. However, any product manager or salesperson that attempts to convince you that Big Data is simple or easily exploited is lying by act, omission, or ignorance. It is important you understand how the real magic is done.  Because then you can understand the limitations of what can be done and recognize new insights. Key to this is understanding what ‘true’ means in the mathematics of Big Data.


True conclusions occur because, or I could say when, data is dependently related to other data. Something is true when two fundamental things occur. First thing: the conclusion, or discovery, is not likely to have been caused by chance. This is the confidence interval and you see it as the "degree of error" in a political poll; or the percentage of certainty, "mistaken in less than one chance in a thousand", in a physics discovery. In Bayesian terms, this is the probability of the data given that the model is true. It is the first term in a formal definition of truth.

This extends from the statistics you took in early college. Almost everyone today understands a Gaussian curve – at lease in its common parlance as the bell curve. For example, it describes grade scores on a test, or classification of ability in an IQ test: a graph of a population against the value differences of a specific variable. Similarly, distribution of monthly sales totals in a population of salespeople. This normal distribution works to demarcate relative value across a single variable against a population.  People are the population and their score is the variable. The group of all testers is related to the group of all scores. 


Latest Updates

Subscribe to our YouTube Channel