Friday, 17 July 2015

The peril of data mining


‘Data mining is the computational process of discovering patterns in data’- that’s what you get from Wikipedia as the definition of data mining. Now the question is which DATA. Most commonly this is AVAILABLE data which is used to UNCOVER patterns. What about the data which is not available?
There can be two types of data which can be unavailable-

  •  You don’t have access to that data
  • You are unaware of such data

And these can also be used to uncover patterns. But that’s not the peril. The peril is that unavailability of these data can lead to misleading business conclusions. 

Let us take an example. A manufacturing company wants to start a campaign in a given region for prospect customers and want to know which population group it should target. It decided to leverage its existing customer data and target the prospects which look similar to existing customers. After mining the available data it came up with the insight that any customer who is college educated and earns more than $90,000/year is a good prospect. Now, what is the problem with this finding? The problem is that this can be a common profile among the population of given region. You are certainly missing some other data to solve the business problem. And the data you are missing is demographic characteristics of the population as a whole in the region. Your prospect should be someone who is similar to existing customers but ALSO differs from the general population.  

There is no defined way to know IF and WHAT data is unavailable. Here are few points which can help in this case-

  • Check what other data are available from the same source
  • Check what other data are available in same domain
  •  Check super-set of available data from different perspective like time, geography etc.
  • And most importantly, opinions are as important as data

Having said all this, always remember-
     In god we trust. All others JUST bring data. 

Friday, 24 April 2015

Indian Railways Network using ‘R’


Indian railways is lifeline for Indian people. Millions of people travel daily by railway. Indian rail network comprises 71 thousand miles of track over a route of 40 thousand miles and more than 7 thousand stations.

Here I try to show the rail network using ‘R’ software. Here is the list of all the mail, express, and super-fast trains provided in Indian railway site-
 http://www.indianrail.gov.in/mail_express_trn_list.html.
It contains names of source and destination stations for each train. I will leverage this data to draw the network. Before looking into R code details, let us see the final output-



There is a link connecting two stations on map if there are one or more than one direct trains running between these stations. More the number of trains, more prominent the link is. As I have drawn a straight line between stations, it does not capture exact route of the train. Also the color of the link is same as that of map if there in only 1 train between two stations. This I have done to avoid the messy appearance of the map. The purpose of the map is to get general sense of Indian rail network. It can be seen that 4 major stations are Delhi, Mumbai, Kolkata, and Chennai.

Here are details of ‘R’ Code-
First load the required ‘R’ packages. I use ‘maps’ to draw India map and ‘ggmap’ to get longitude and latitude information for all railway stations. It uses Google maps API to get required information.
library(maps) 
library(ggmap) 
library(dplyr)

Next read the data. I have saved data from above mentioned link in csv format.
rail.data <- read.csv("rail_data_csv.csv")

Now I do some pre-processing on station names so that these can be passed to Google API in order to get location coordinates-
# Convert Station names into character string                                                                         rail.data$Train.Source.Stn <- as.character(rail.data$Train.Source.Stn)                  rail.data$Train.Destination.Stn <- as.character(rail.data$Train.Destination.Stn)
# Remove 'JN', which represents junction, from station names                                          rail.data$Train.Source.Stn <- gsub("JN", "", rail.data$Train.Source.Stn)              rail.data$Train.Destination.Stn <- gsub("JN", "", rail.data$Train.Destination.Stn)
# Append "INDIA" in station name to remove any ambiguity in location identification by API rail.data$Train.Source.Stn <- paste(rail.data$Train.Source.Stn,", INDIA",sep='')   rail.data$Train.Destination.Stn <- paste(rail.data$Train.Destination.Stn,", INDIA",sep='')
# Get all unique stations including source and destination stations 
all.stations <- unique(c(rail.data$Train.Source.Stn,rail.data$Train.Destination.Stn))

Get longitude/ latitude information for all station locations using ‘geocode’ function in ‘ggmap’ library, which uses Google Maps API to get coordinates.
all.longitudes <- as.numeric(NA[1:length(all.stations)])                                                                         all.latitudes <- as.numeric(NA[1:length(all.stations)])
 
for ( i  in 1: length(all.locations)) 
  { 
      coordinates <- geocode(all.stations[i])
      all.longitudes[i] <- coordinates$lon
      all.latitudes[i] <- coordinates$lat
   }
# Join coordinates with stations names 
all.locations <- as.data.frame(cbind(name=all.stations, lon=all.longitudes, lat=all.latitudes),stringsAsFactors=FALSE) 
all.locations$lon <- as.numeric(all.locations$lon)                                                                           all.locations$lat <- as.numeric(all.locations$lat)
# Get total number of trains between two stations using ‘group_by’ from ‘dplyr’ package        no.of.trains <- as.data.frame(rail.data[,c("Train.Source.Stn","Train.Destination.Stn")] %>% group_by(Train.Source.Stn,Train.Destination.Stn) %>% summarise(count=n()))

# Sort data based upon number of trains                                                                                           no.of.trains = no.of.trains[order(no.of.trains$count),] 
# Merge coordinates to rail data                                                                                                              no.of.trains$name <- no.of.trains$Train.Source.Stn
no.of.trains <- left_join (no.of.trains, all.locations) 
no.of.trains <- rename(no.of.trains, source.lon = lon, source.lat=lat )
no.of.trains$name <- no.of.trains$Train.Destination.Stn
no.of.trains <- left_join (no.of.trains, all.locations) 
no.of.trains <- rename(no.of.trains, dest.lon = lon, dest.lat=lat )

Draw map of India using ‘map’ function. Map function allows restricting drawing of map within user provided Longitude/ Latitude. I will use this parameter to get India map.

xlim <- c(67, 98)  
ylim <- c(7, 37)
map("world", col="lavender", fill=TRUE, bg="white", lwd=.1, xlim=xlim, ylim=ylim)

Set color palette with different shades of colors. It is required so that we can vary prominence between links. Link representing more number of trains are more prominent and vice-versa.
color.palette <- colorRampPalette(c("lavender", "red"))
all.colors <- color.palette(7)

# Get maximum number of trains between any of two stations- this is required to choose right shade for color                                                                                                                                      max.count <- max(no.of.trains$count)

Loop over all the links in data and draw corresponding stations and lines with right shade for that on top of earlier produced map-
for (i in 1:nrow(no.of.trains))                                                    
 {
    points.lon <- c(no.of.trains$source.lon[i],no.of.trains$dest.lon[i])                    
    points.lat <- c(no.of.trains$source.lat[i],no.of.trains$dest.lat[i]) 
    color.index <- round( (no.of.trains$count[i] / max.count) * length(all.colors) ) 
    lines(x = points.lon, y = points.lat , col=all.colors[color.index], lwd=.8)
    points(x = points.lon, y = points.lat, col = "blue", pch=20, cex=.8) 
  }

As mentioned earlier, links with just 1 train would not be visible on map as it is assigned same color as that of map background. Otherwise map gets messed up with unmanageable number of links.

Thanks for now!