The Big Data Hubris

ORDER REPRINTS DOWNLOAD COMMENT DISCUSS SHARE

Big data has been with us for years, it just hasn't always been referred to as such. Back in 2008, when smartphones were just beginning their meteoric rise in popularity, Google debuted a big data tool that was heralded as a poster child for technology: Google Flu Trends (GFT). The tool tracked 45 flu-related search terms over billions of searches, monitoring trends and making correlations to predict flu outbreaks and severity. Improving healthcare with smart number crunching--what's not to love? Well, a recent paper in Science pointed out a rather large un-lovable: GFT is nearly always wrong, and often by more than 50%.

According to the paper “GFT overestimated the prevalence of flu in the 2012–2013 season and overshot the actual level in 2011–2012 by more than 50%. From 21 August 2011 to 1 September 2013, GFT reported overly high flu prevalence 100 out of 108 weeks.”

The authors of the paper, proponents of big data solutions, did not approach the research with an agenda. In fact, the researchers found that simply using the recent trend of C.D.C. reports from doctors on influenza-like illness, which lag by two weeks, would have been a more accurate predictor than Google Flu Trends.

This highlights a trend some have called the "hubris of big data."

Downloads

Inquiry Form

Web Links

About CES 2014

Pipeline continues its legacy of bringing together the world’s leading service providers and technology innovators this fall at The 2014 COMET Executive Summit. This exclusive event gathering Pipeline journalists, Industry Advisory Board (IAB) Members, and key solution providers will be an intimate symposium to shape the editorial direction of Pipeline, gather priceless input from executive-level service provider experts, and create lasting industry relationships.

Pipeline’s IAB is an exclusive group of service provider and analyst executives who have long-term relationships with Pipeline and have played a role in Pipeline programs, editorial direction, and provided content over the last decade. This year, Pipeline opens the doors to provide an opportunity to engage directly with a broad cross section of experts who evaluate, recommend, and purchase communications and entertainment technology (COMET) products and services. Multiple levels of participation provide your company with an exclusive networking opportunity, tailored to your goals and budget.

The COMET Executive Summit will bring together executives from the world’s leading service provider and technology companies, in a flexible format that is filled with unprecedented networking opportunities designed to build relationships that can be carried forward to solve issues facing service providers today. Some of the topics planned for discussion include:

Networking
Customer experience management (CEM)
Big data & analytics
Exploring cloud offerings
Enabling new business models
Delivering and assuring digital services
Network evolution & virtualization
Leveraging content
Network security

For more information, visit
www.pipelinepub.com/info/comet/2014_comet_summit.php

The ability to capture mountains of data in real time doesn't immediately translate into value or predictive power; in fact, sometimes it translates into excessive cost and incorrect guidance. Here are some common pitfalls, and some lessons that can be learned.

Using one data source

In the case of GFT, a major problem was a reliance on one source--Google searches--as the foundation for analysis. “The mash-up is the way to go,” Mr. Lazer said. His analysis shows that combining Google Flu Trends with C.D.C. data, and applying a few tweaking techniques, works best.

Effectively collecting, processing, and embedding both structured and unstructured data into daily operations is key to accelerating business. Accuracy and availability are critical; a decision based on incorrect, incomplete, or missing information can put your business at risk. GFT used one data source: user searches. Google search algorithms change depending on the person, the advertising, etc., so using that tool as the baseline for data collection didn’t reflect reality.

Don't force data to fit

Another major issue with GFT is the way data was shoe-horned into categories, without enough consideration for causal relationships. “They overfit the data. They had fifty million search terms, and they found some that happened to fit the frequency of the ‘flu’ over the preceding decade or so, but really they were getting idiosyncratic terms that were peaking in the winter at the time the ‘flu’ peaks … but wasn’t driven by the fact that people were actually sick with the ‘flu’,” Lazer says.

3 Big Data Myths in telecommunications

Myth 1: You need all of the data

The foundations of statistical analysis did not change when hard drives became cheaper and more voluminous. The ability to store all data from all sources is costly and counter intuitive. Fractional data sampling is nearly always as accurate as wholesale collection, although many vendors don't want you to believe this.

Follow @PipelineWire