Friday, 17 July 2015

The peril of data mining


‘Data mining is the computational process of discovering patterns in data’- that’s what you get from Wikipedia as the definition of data mining. Now the question is which DATA. Most commonly this is AVAILABLE data which is used to UNCOVER patterns. What about the data which is not available?
There can be two types of data which can be unavailable-

  •  You don’t have access to that data
  • You are unaware of such data

And these can also be used to uncover patterns. But that’s not the peril. The peril is that unavailability of these data can lead to misleading business conclusions. 

Let us take an example. A manufacturing company wants to start a campaign in a given region for prospect customers and want to know which population group it should target. It decided to leverage its existing customer data and target the prospects which look similar to existing customers. After mining the available data it came up with the insight that any customer who is college educated and earns more than $90,000/year is a good prospect. Now, what is the problem with this finding? The problem is that this can be a common profile among the population of given region. You are certainly missing some other data to solve the business problem. And the data you are missing is demographic characteristics of the population as a whole in the region. Your prospect should be someone who is similar to existing customers but ALSO differs from the general population.  

There is no defined way to know IF and WHAT data is unavailable. Here are few points which can help in this case-

  • Check what other data are available from the same source
  • Check what other data are available in same domain
  •  Check super-set of available data from different perspective like time, geography etc.
  • And most importantly, opinions are as important as data

Having said all this, always remember-
     In god we trust. All others JUST bring data. 

Friday, 24 April 2015

Indian Railways Network using ‘R’


Indian railways is lifeline for Indian people. Millions of people travel daily by railway. Indian rail network comprises 71 thousand miles of track over a route of 40 thousand miles and more than 7 thousand stations.

Here I try to show the rail network using ‘R’ software. Here is the list of all the mail, express, and super-fast trains provided in Indian railway site-
 http://www.indianrail.gov.in/mail_express_trn_list.html.
It contains names of source and destination stations for each train. I will leverage this data to draw the network. Before looking into R code details, let us see the final output-



There is a link connecting two stations on map if there are one or more than one direct trains running between these stations. More the number of trains, more prominent the link is. As I have drawn a straight line between stations, it does not capture exact route of the train. Also the color of the link is same as that of map if there in only 1 train between two stations. This I have done to avoid the messy appearance of the map. The purpose of the map is to get general sense of Indian rail network. It can be seen that 4 major stations are Delhi, Mumbai, Kolkata, and Chennai.

Here are details of ‘R’ Code-
First load the required ‘R’ packages. I use ‘maps’ to draw India map and ‘ggmap’ to get longitude and latitude information for all railway stations. It uses Google maps API to get required information.
library(maps) 
library(ggmap) 
library(dplyr)

Next read the data. I have saved data from above mentioned link in csv format.
rail.data <- read.csv("rail_data_csv.csv")

Now I do some pre-processing on station names so that these can be passed to Google API in order to get location coordinates-
# Convert Station names into character string                                                                         rail.data$Train.Source.Stn <- as.character(rail.data$Train.Source.Stn)                  rail.data$Train.Destination.Stn <- as.character(rail.data$Train.Destination.Stn)
# Remove 'JN', which represents junction, from station names                                          rail.data$Train.Source.Stn <- gsub("JN", "", rail.data$Train.Source.Stn)              rail.data$Train.Destination.Stn <- gsub("JN", "", rail.data$Train.Destination.Stn)
# Append "INDIA" in station name to remove any ambiguity in location identification by API rail.data$Train.Source.Stn <- paste(rail.data$Train.Source.Stn,", INDIA",sep='')   rail.data$Train.Destination.Stn <- paste(rail.data$Train.Destination.Stn,", INDIA",sep='')
# Get all unique stations including source and destination stations 
all.stations <- unique(c(rail.data$Train.Source.Stn,rail.data$Train.Destination.Stn))

Get longitude/ latitude information for all station locations using ‘geocode’ function in ‘ggmap’ library, which uses Google Maps API to get coordinates.
all.longitudes <- as.numeric(NA[1:length(all.stations)])                                                                         all.latitudes <- as.numeric(NA[1:length(all.stations)])
 
for ( i  in 1: length(all.locations)) 
  { 
      coordinates <- geocode(all.stations[i])
      all.longitudes[i] <- coordinates$lon
      all.latitudes[i] <- coordinates$lat
   }
# Join coordinates with stations names 
all.locations <- as.data.frame(cbind(name=all.stations, lon=all.longitudes, lat=all.latitudes),stringsAsFactors=FALSE) 
all.locations$lon <- as.numeric(all.locations$lon)                                                                           all.locations$lat <- as.numeric(all.locations$lat)
# Get total number of trains between two stations using ‘group_by’ from ‘dplyr’ package        no.of.trains <- as.data.frame(rail.data[,c("Train.Source.Stn","Train.Destination.Stn")] %>% group_by(Train.Source.Stn,Train.Destination.Stn) %>% summarise(count=n()))

# Sort data based upon number of trains                                                                                           no.of.trains = no.of.trains[order(no.of.trains$count),] 
# Merge coordinates to rail data                                                                                                              no.of.trains$name <- no.of.trains$Train.Source.Stn
no.of.trains <- left_join (no.of.trains, all.locations) 
no.of.trains <- rename(no.of.trains, source.lon = lon, source.lat=lat )
no.of.trains$name <- no.of.trains$Train.Destination.Stn
no.of.trains <- left_join (no.of.trains, all.locations) 
no.of.trains <- rename(no.of.trains, dest.lon = lon, dest.lat=lat )

Draw map of India using ‘map’ function. Map function allows restricting drawing of map within user provided Longitude/ Latitude. I will use this parameter to get India map.

xlim <- c(67, 98)  
ylim <- c(7, 37)
map("world", col="lavender", fill=TRUE, bg="white", lwd=.1, xlim=xlim, ylim=ylim)

Set color palette with different shades of colors. It is required so that we can vary prominence between links. Link representing more number of trains are more prominent and vice-versa.
color.palette <- colorRampPalette(c("lavender", "red"))
all.colors <- color.palette(7)

# Get maximum number of trains between any of two stations- this is required to choose right shade for color                                                                                                                                      max.count <- max(no.of.trains$count)

Loop over all the links in data and draw corresponding stations and lines with right shade for that on top of earlier produced map-
for (i in 1:nrow(no.of.trains))                                                    
 {
    points.lon <- c(no.of.trains$source.lon[i],no.of.trains$dest.lon[i])                    
    points.lat <- c(no.of.trains$source.lat[i],no.of.trains$dest.lat[i]) 
    color.index <- round( (no.of.trains$count[i] / max.count) * length(all.colors) ) 
    lines(x = points.lon, y = points.lat , col=all.colors[color.index], lwd=.8)
    points(x = points.lon, y = points.lat, col = "blue", pch=20, cex=.8) 
  }

As mentioned earlier, links with just 1 train would not be visible on map as it is assigned same color as that of map background. Otherwise map gets messed up with unmanageable number of links.

Thanks for now!
 

Thursday, 6 November 2014

How do we migrate?

Recently I came across world migration data publicly available on United Nations site. Here you can find this data- http://esa.un.org/wpp/Excel-Data/migration.htm.  It shows the net number of migrants, that is, the number of immigrants minus the number of emigrants. It is expressed as thousands and the numbers are available across different countries. Looking into the data, the first thought came to my mind is to draw this on world map. And it can be done very easily in R. Later in this post I describe the r code. But before that, let us see the world heat-map.




As expected population drift is from less developed to more developed regions.

Here is the R code for this-

#Required Library- fastest way to draw this chart is through ‘rworldmap’ package

library(rworldmap)

#Load Data- it loads data in R environment

migrantsData <- read.csv("Migrants_data.csv")

#Join data to map- this step essentially attaches location (Long\ Lat) for each country

spdf <- joinCountryData2Map(migrantsData,joinCode="UN", nameJoinColumn="Country.code",nameCountryColumn= "Country",verbose = TRUE)

#Draw world map- required function to draw this map is ‘mapCountryData’. For more details on this, #please refer - ‘http://cran.r-project.org/web/packages/rworldmap/rworldmap.pdf’

par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")

mapParams <- mapCountryData( spdf, nameColumnToPlot="Migrants", addLegend=FALSE,numCats=7,colourPalette="terrain", mapTitle="Net Number of Migrants (Immigrants - Emigrants), 2005-2010 (Thousands)", oceanCol="lightblue")

do.call( addMapLegend, c(mapParams, legendWidth=0.5, legendMar = 4,legendLabels="all"))  

And that’s it! Similar code can be used for any other metric.


Thanks, for now!

Tuesday, 7 October 2014

Dear Flipkart: You got it completely wrong!


At first place #BigBillionDay (http://www.deccanchronicle.com/141005/business-latest/article/will-flipkart-manage-make-huge-profits-its-big-billion-day) seemed a good strategic move for Flipkart but sadly it ended with lots of unhappy customers. Flipkart planned this as a move to attract vast traffic on its site but it failed where it matters most i.e. addressing customers' expectation. It left no stone unturned to set-up high expectations but it seems nothing was done to address those.  Final result- Unhappy customers. Let me support this claim with numbers. I compared customers’ sentiment (derived using customers’ tweets as described in first post) on the previous day of #BigBillionDay which was 5th Oct’14 with customers’ sentiment on #BigBillionDay which was 6th Oct’14. It turns out that proportion of negative sentiment experienced a sharp increase from 20% to 55%- resulting into an overall increase of 175%. It’s a big drift considering the change happened in just one day.


Hope Flipkart would overcome this and do wonders in future as it’s just the starting of the game.

Friday, 5 September 2014

Sentiment Analysis: Distribution of tweets length based upon the sentiment

In last two posts, we worked upon twitter data for Flipkart and Amazon India and showed how to leverage that to capture customers' sentiments and show word clouds.

In this post we would try to explore an interesting pattern which is quite intuitive but supporting that with the data is the name of the game.

Did you ever observe that we try to convey our thoughts more eloquently under the effect of some sentiments e.g. in case of rage? I am sure you did. And that in turn lead us to put forward our thoughts in more detailed manner rather than we would otherwise. And that's the hypothesis we want to prove in this study.

All the data used in this analysis is same as that used in earlier posts i.e. we use tweets related to Flipkart and Amazon India. If you are interested how to fetch these tweets in R, please refer earlier posts.

OK, so that's the enough background I guess, let us see results-















Let us see what we did here. First we classified each tweet in one of the 7 buckets based upon the sentiment it carries (which we derived in first post). So if a tweet carries a strong positive sentiment it falls in last bucket (as in graph above) which is 'Excellent'. On the other hand if a tweet carries a strong negative sentiment, we put this in first bucket which is defined as 'Worst' in above graph. Any tweet which does not carry any sentiment falls under 'Neutral'. We have defined two more levels of sentiments each between 'Worst' and 'Neutral', and between 'Neutral' and 'Excellent'. Further we define length of a tweet as number of words present in that tweet. So in the above plot, we see average length of the tweets by sentiment categories.

Here are some interesting findings (findings? we had this as intuition :)) -
  1. Higher the degree of sentiment, lengthier is the tweet
  2. Tweets carrying negative sentiments are lengthier than tweets carrying positive sentiments, if we keep the degree of sentiment in same range. E.g. if we compare 'Worst' with 'Excellent' and so on
That is it! 

I would try to see if I can find more hypotheses to verify. Please suggest if you have any.

Thanks for now!

Tuesday, 26 August 2014

Flipkart VS. Amazon.in: Word cloud using twitter posts

As promised in last post, here I present word clouds generated from Flipkart and Amazon India tweets. A word cloud (http://en.wikipedia.org/wiki/Tag_cloud) is a graphical representation of word frequency with greater prominence to words which are more frequent in source text. I have generated two sets of word clouds where each set contains two clouds, one for Filpkart and another for Amazon. First set of clouds is based upon all the words present in extracted tweets however second set of clouds is based upon the sentiment words only which are present in extracted tweets for analysis. Corpus of tweets and sentiment words used in this analysis are same as that of used in earlier analysis which I shared in last post.

Before showing word clouds, here I summarize the key steps involved in analysis-

  1. Read the tweets in R as described in last post
  2. Perform some text processing. For this, I used text mining package in R (http://cran.r-project.org/web/packages/tm/index.html
    1. Generate a two-column format structure from tweets where first column contains words and second column contains frequency of the word in all tweets
    2. Convert all words in lower case
    3. Remove stop words
    4. Remove punctuation marks
    5. Get top 100 most occurring words for first set of clouds and top 100 sentiment words for second set of clouds
  3. Generate word cloud. I used word cloud package in R (http://cran.rproject.org/web/packages/wordcloud/index.html) for this
Coming on the results- first we will see word clouds generated from all the words, which we referred as first set of clouds. Below is that for Flipkart-



Now the same for Amazon India-






















As you see these word clouds, you can get the sense of what people are  talking about e.g. in case Flipkart people are mostly talking about Xiaomi mi3 mobile phone which has been a recent hit in Flipkart.

Now coming on the second set of word clouds where we focus only upon the words carrying some sort of sentiment, here is that for Flipkart-























And here is same for Amazon India-






















As one can see in both of the above word clouds that most of the prominent words carry positive sentiments. However if you focus upon words carrying negative sentiments you would see that degree of prominence in case of Flipkart is more than that of Amazon India and this supports the outcome of analysis we saw in last post.

Thanks for now!

Wednesday, 20 August 2014

Flipkart VS. Amazon.in: Sentiment analysis using twitter posts


Flipkart and Amazon India are emerging as two biggest players in rapidly growing online retail industry in India. Although Amazon started its operations in India much later than Flipcart, it is giving tough competition to Flipkart. Only future would tell who will surpass another in long run but it is evident that effectiveness to capture customers' needs and quickness to respond accordingly are going to play a major role.

In this exercise i tried to capture customers' sentiments using customers' twitter postings. I used R (http://www.r-project.org/) for this exercise. Below are the key steps describing analysis process.

1.     Search for presence of twitter handles (@Flipkart for Flipkart tweets and @amazonIN for Amazon India tweets) and scrape the tweets accordingly- I used twitteR package in R (http://cran.r-project.org/web/packages/twitteR/index.html) to fetch tweets

2.     Perform pre-processing like remove duplicate tweets etc.

3.     Apply sentiment analysis algorithm to group tweets in one of the two groups i.e. either positive sentiment or negative sentiment- I used a pretty simple algorithm for this which takes into account occurrence of positive and negative sentiment words in each tweet. For sentiment words I used publicly available dictionaries containing sentiment words

So, as you can see it’s quite simple and fast. The only caveat is that Twitter web API imposes restriction on the number of tweets one can access. Nevertheless, one can access thousands of tweets which are good enough to perform not so exhaustive analysis.

OK, now here is the stuff for which we did all this i.e. results. It comes as conclusion that both Flipkart and Amazon score impressively on customer sentiments however Amazon performs slightly better. For Flipcart, around 64% of tweets under analysis carry positive sentiments and 36% carry negative sentiments. However in case of Amazon these figures are 73% and 27% respectively. Below is the graph depicting these numbers.



In next post, i would try to show word cloud supporting above trends. 

Thanks for now!