Friday, 17 July 2015

The peril of data mining


‘Data mining is the computational process of discovering patterns in data’- that’s what you get from Wikipedia as the definition of data mining. Now the question is which DATA. Most commonly this is AVAILABLE data which is used to UNCOVER patterns. What about the data which is not available?
There can be two types of data which can be unavailable-

  •  You don’t have access to that data
  • You are unaware of such data

And these can also be used to uncover patterns. But that’s not the peril. The peril is that unavailability of these data can lead to misleading business conclusions. 

Let us take an example. A manufacturing company wants to start a campaign in a given region for prospect customers and want to know which population group it should target. It decided to leverage its existing customer data and target the prospects which look similar to existing customers. After mining the available data it came up with the insight that any customer who is college educated and earns more than $90,000/year is a good prospect. Now, what is the problem with this finding? The problem is that this can be a common profile among the population of given region. You are certainly missing some other data to solve the business problem. And the data you are missing is demographic characteristics of the population as a whole in the region. Your prospect should be someone who is similar to existing customers but ALSO differs from the general population.  

There is no defined way to know IF and WHAT data is unavailable. Here are few points which can help in this case-

  • Check what other data are available from the same source
  • Check what other data are available in same domain
  •  Check super-set of available data from different perspective like time, geography etc.
  • And most importantly, opinions are as important as data

Having said all this, always remember-
     In god we trust. All others JUST bring data.