Friday, 5 September 2014

Sentiment Analysis: Distribution of tweets length based upon the sentiment

In last two posts, we worked upon twitter data for Flipkart and Amazon India and showed how to leverage that to capture customers' sentiments and show word clouds.

In this post we would try to explore an interesting pattern which is quite intuitive but supporting that with the data is the name of the game.

Did you ever observe that we try to convey our thoughts more eloquently under the effect of some sentiments e.g. in case of rage? I am sure you did. And that in turn lead us to put forward our thoughts in more detailed manner rather than we would otherwise. And that's the hypothesis we want to prove in this study.

All the data used in this analysis is same as that used in earlier posts i.e. we use tweets related to Flipkart and Amazon India. If you are interested how to fetch these tweets in R, please refer earlier posts.

OK, so that's the enough background I guess, let us see results-















Let us see what we did here. First we classified each tweet in one of the 7 buckets based upon the sentiment it carries (which we derived in first post). So if a tweet carries a strong positive sentiment it falls in last bucket (as in graph above) which is 'Excellent'. On the other hand if a tweet carries a strong negative sentiment, we put this in first bucket which is defined as 'Worst' in above graph. Any tweet which does not carry any sentiment falls under 'Neutral'. We have defined two more levels of sentiments each between 'Worst' and 'Neutral', and between 'Neutral' and 'Excellent'. Further we define length of a tweet as number of words present in that tweet. So in the above plot, we see average length of the tweets by sentiment categories.

Here are some interesting findings (findings? we had this as intuition :)) -
  1. Higher the degree of sentiment, lengthier is the tweet
  2. Tweets carrying negative sentiments are lengthier than tweets carrying positive sentiments, if we keep the degree of sentiment in same range. E.g. if we compare 'Worst' with 'Excellent' and so on
That is it! 

I would try to see if I can find more hypotheses to verify. Please suggest if you have any.

Thanks for now!

No comments:

Post a Comment