Big Data and Machine Learning

ORDER REPRINTS DOWNLOAD COMMENT DISCUSS SHARE

The current and future big social media data has its own language of expressing opinions and sentiments.

Learning from the brain

Another development in the machine learning area is the development of learning algorithms which are essentially the models of the neocortex. The neocortex is the organ of intelligence in the brain. Some of the learning algorithms developed can faithfully capture how layers of neurons in the neocortex learn. At the heart of this is the Cortical Learning Algorithm (CLA), which is based on the following principles:

Sparse Distributed representations: These are the representation used in the brain. It means that only a sparse or few neurons are active out of the large population and the activation of these distributed neurons is required in order to represent something.
Temporal inference: The algorithm learns by finding patterns and then sequences of patterns. It looks for patterns which occur often together (also known as spatial patterns) and how these spatial patterns occur over time which is called the temporal pattern.
On-line learning: The algorithms learn from each new input. There is no separate learning phase from an inference phase.

These algorithms are being used to predict future values as well as find anomalies in the data. A real challenge would be to apply algorithms to big data and evaluate their performance. This would answer whether or not a brain model, which works well in case of limited data, can work in accurately classifying big data.

Pre-processing

The current and future big social media data has its own language of expressing opinions and sentiments. This language has to be interpreted first to analyze the orientation of the human sentiment towards a product or service. We expect to see lot of development in the pre-processing of the data so that the input to the machine learning is more correct. For example, “bad” might mean “good” in a certain context (think Michael Jackson).

Deep learning

Another area which is gaining importance is representation learning in which data has to be represented correctly for machine learning. The complexity of Big Data which is due to high dimensionality, large number of classes and their unstructured elements throws a challenge for success of machine learning algorithms. With respect to this, a lot of work is being done in deep learning. Deep learning methods attempt to learn in multiple levels, corresponding to different levels of abstraction. These methods, such as deep belief networks, sparse coding-based methods, convolutional networks, deep Boltzmann machines, and dropout have shown promise as a means of learning invariant representations of data and have already been successfully applied to a variety of tasks in computer vision, audio processing, natural language processing, information retrieval, robotics, drug design and finance.

Cutting the noise

Big data brings along with it noisy data. Take the case of functional magnetic resonance image (fMRI) data which is very high dimensional. Every day the fMRI machines keep scanning the human brains and constantly produce new imaging data. The fMRI images are "noisy" due to technological limitations and possible head motion of the subjects. Analyzing or treating such noisy data is a challenge in the pre-processing step for machine learning. In the area of genomics, experiments are conducted in different laboratories leading to misalignment of data and missing values. Identifying outliers from this data becomes tricky. The treatment of missing values is yet another challenge in the pre-processing step for machine learning.

The field is still maturing

As these challenges illustrate, machine learning as a field is still not mature. Although the research is promising and the use cases are logical, it will take considerable efforts to realize the predictive power of machine learning in the big data domain. Machine learning cannot be simply picked up by researchers and applied in any domain.

For these reasons, a data scientist has never been more important to an organization, and the creation of a data team is an excellent strategy. The proper problem formulation, selection of features, choice of algorithms, choice of the statistics behind the algorithms and the problem have to be decided by an expert in this field—this is not an easy task. Thus these skills are multidisciplinary and they require time, experience, intelligence, and inquisitiveness to develop.