The Big Data Hubris

By: Jesse Cryderman

Big data has been with us for years, it just hasn't always been referred to as such. Back in 2008, when smartphones were just beginning their meteoric rise in popularity, Google debuted a big data tool that was heralded as a poster child for technology: Google Flu Trends (GFT). The tool tracked  45 flu-related search terms over billions of searches, monitoring trends and making correlations to predict flu outbreaks and severity. Improving healthcare with smart number crunching--what's not to love? Well, a recent paper in Science pointed out a rather large un-lovable: GFT is nearly always wrong, and often by more than 50%. 

According to the paper “GFT overestimated the prevalence of flu in the 2012–2013 season and overshot the actual level in 2011–2012 by more than 50%. From 21 August 2011 to 1 September 2013, GFT reported overly high flu prevalence 100 out of 108 weeks.”

The authors of the paper, proponents of big data solutions, did not approach the research with an agenda. In fact, the researchers found that simply using the recent trend of C.D.C. reports from doctors on influenza-like illness, which lag by two weeks, would have been a more accurate predictor than Google Flu Trends.

This highlights a trend some have called the "hubris of big data."

The ability to capture mountains of data in real time doesn't immediately translate into value or predictive power; in fact, sometimes it translates into excessive cost and incorrect guidance.  Here are some common pitfalls, and some lessons that can be learned.

Using one data source

In the case of GFT, a major problem was a reliance on one source--Google searches--as the foundation for analysis.  “The mash-up is the way to go,” Mr. Lazer said. His analysis shows that combining Google Flu Trends with C.D.C. data, and applying a few tweaking techniques, works best.

Effectively collecting, processing, and embedding both structured and unstructured data into daily operations is key to accelerating business. Accuracy and availability are critical; a decision based on incorrect, incomplete, or missing information can put your business at risk. GFT used one data source: user searches. Google search algorithms change depending on the person, the advertising, etc., so using that tool as the baseline for data collection didn’t reflect reality.  

Don't force data to fit

Another major issue with GFT is the way data was shoe-horned into categories, without enough consideration for causal relationships. “They overfit the data. They had fifty million search terms, and they found some that happened to fit the frequency of the ‘flu’ over the preceding decade or so, but really they were getting idiosyncratic terms that were peaking in the winter at the time the ‘flu’ peaks … but wasn’t driven by the fact that people were actually sick with the ‘flu’,” Lazer says. 

3 Big Data Myths in telecommunications

Myth 1: You need all of the data

The foundations of statistical analysis did not change when hard drives became cheaper and more voluminous. The ability to store all data from all sources is costly and counter intuitive. Fractional data sampling is nearly always as accurate as wholesale collection, although many vendors don't want you to believe this.


Latest Updates

Click to Discover>

Subscribe to our YouTube Channel