Data Smashing

Data smashing applications. Pairwise distance matrices, identified clusters and three-dimensional projections of Euclidean embeddings for epileptic pathology identification (a), identification of heart murmur (b) and classification of variable stars from photometry (c). In these applications, the relevant clusters are found unsupervised. Credit: Ishanu Chattopadhyay, Hod Lipson

The availability of huge data streams is pushing researchers to find innovative ways to analyse their content, compare it and derive meaning. This is not easy at all.

So far data mining approaches are based on a hunch from a human being that is then coded into an algorithm to see if it is true. Data are analysed to prove or disprove a certain hypotheses. As an example you may want to see if the pattern of buying a product is related to some ads broadcasted on different media and in different places. This works pretty well and many market researches are based on such algorithms. 

Same goes for the effectiveness of certain drugs combination in fighting a disease or the capturing of the first signs of an epidemic.  

However, we don't have algorithms that can point out something they were not programmed to look for. They miss the serendipity of our eyes (and brain).

This is what researchers at Cornell and University of Chicago set out to change.

They have proposed in a nice paper to use a new technique, they call it data smashing, for discovering unexpected information.

The approach is based on the idea that if you have recurrence of data in different streams you can annihilate them and what you are left with are actual new information. By applying data smashing along with the classical analyses used in big data (where the point is to look for patterns) you get both the similarities and the differences. Out of this patterns may emerge.

They have tested their hypotheses on a number of data streams and indeed they have been able to discover patterns that were not coded in the algorithms. The figure presents three examples.

We can expect in the coming years that radically new approaches to mine huge data sets will be discovered (invented). Some of these may well derive from advances in our understanding of the serendipity that so often seems to guide our discoveries when we just look around....

Author - Roberto Saracco

© 2010-2018 EIT Digital IVZW. All rights reserved. Legal notice