With almost everyone on earth having access to technology, voicing out our opinion on the internet has become very common. There are new technologies coming out almost as frequent as every single day. One such technology that stole the spotlight is Twitter. According to the statistics available online, Twitter roughly has 1.3 billion users of which it has 310 million monthly active users who regularly log back in to share their thoughts at the rate of almost 500 million tweets per day. With this high magnitude of data created every day, this area can be used to perform various text mining analysis to understand if the people have a positive or a negative aspect of any given idea. These tweets will help us uncover various invaluable insights into user’s thoughts.
However, there are various challenges when it comes to mining twitter data because of the word limit posted by twitter which allows the user to type only 140 characters in a given tweet out of which most of the elements like username tagging, Hashtags and URLs are unnecessary information. This space limitation not only leads to incompleteness of tweets but also forces people to abbreviate their words and sentences which leads them to form grammatically incorrect words. Moreover, the algorithm fails to detect sarcasm which is a very common human trait where negative tweets can be misclassified as positive tweets.
There are more recently found technologies such as Word2Vec which converts a word into a high dimensional(~200-300) vector space and also help capture the context of the word while grouping them along with similar words when mapped into vector space. Our study aims at analyzing and classifying the sentiments of the tweets into 2 categorical form i.e, positive and negative. Sentiment analysis is a Natural Language Processing task which deals with analyzing text and syntactic context thus identification of the subjective information on twitter posts becomes possible. Standard algorithms for text classification include: Gaussian Naive Bayes, Support Vector Classifiers and Logistic Regression.
These classifiers have helped prove successful in various text classification problems mostly because we only require a binary output, positive or negative, 1 or 0. However, the vector representation of each word is present in high dimensional space which may pose a challenge to figure out the best possible classifier that is suitable for this representation. Thus, we propose to run the embedded word using various classifiers to compare and contrast their efficiency through various parameter tuning. We finally aim to figure the best possible classifier to pair with Word2Vec for optimal output. This project can be extended into various applications such as recommender systems and review classification etc.
Any trending product/movie/place can be taken into consideration and be analyzed to see if that particular thing has a positive or a negative effect on the people.