The proposed method is
to get the score of a sentence based on the features extracted. Once the
features are extracted the data will get a score and based on that we can come
to a conclusion if the sentence falls towards the positive or the negative side.
If the score is above 0.5, it is a positive data and below 0.5 it is a negative
the process of breaking the words in the sentence. They are called as tokens.
By this way, the features can be analyzed in the data.
words are the most common words occurring in the data. For example, the grammar
used in the data will be removed. The stop words can be imported using the
3. PART OF SPEECH TAGGING
functionality tags each word with its part of speech that is based on the word
it tags if it is a noun, verb, adjective, adverb etc. This will help when
sentiwordnet is applied on them.
is a sophisticated feature that can be imported using the package wordnet. It
is a default package present in the natural language tool kit. Synset is a
functionality which helps to find the score of each word. We need to tag the word with its part of
speech and it will give us a score.
The algorithm is used for getting the
sentiment score of the data in the dataset.
The algorithm is used for
6. TF-IDF VECTORIZATION
As the name suggests, it states
the number of times the word has occurred in the dataset. Term Frequency –
Inverse Document Frequency13 helps in retreiving the data too. It is majorly
used in text mining. The value of tf-idf increases when the word appears in the
dataset. This can be imported using the nltk tool kit by importing it from
sklearn.feature_extraction.text package. Once the features are extracted then
it can be used for training the classifier. For example, let us consider the
It is a windy day today.
It is going to rain today.
In both these sentences, the stop
words are removed and only the features are taken, which is “windy”, “day”,
“today” from first sentence and “going”,
“rain”, “today” from the second sentence. It then calculates the term
frequency that is number of times the term has occurred in the data set and how
relevant it is. “today” has occurred two times.