Twitter US Airline Sentiment Analysis

towards-data-science

This post was originally published by at Towards Data Science

Feedback analysis with Lightgbm classifier.

The objective here is to analyze how travelers mentioned their feelings on Twitter in February 2015. It would be fascinating for airlines to use this free data to provide better service to their customers. This Dataset can be downloaded from here.

How can we analyze it?

I have uploaded the data saved in the local directory in Python:

tweets = pd.read_csv('Tweets.csv')

Let’s look at features included in dataset:

tweets.head()

What we are looking for here is the column named “airline_sentiment” and how we can predict it based on travelers’ tweets. This is called Sentiment Analysis.

To have better pictures of observations and features, we can run the following command, and it will provide us with each feature’s character.

tweets.info()

Let’s visualize the number of expressed feelings as negative, neutral, and positive.

plt.figure(figsize=(3,5))
sns.countplot(tweets['airline_sentiment'], order =tweets.airline_sentiment.value_counts().index,palette= 'plasma')
plt.show()

Majorities are negative, and it would be great/free feedback to airlines to provide appropriate responses. We can also show sentiments for each airline.

g = sns.FacetGrid(tweets, col=”airline”, col_wrap=3, height=5, aspect =0.7) g = g.map(sns.countplot, “airline_sentiment”,order =tweets.airline_sentiment.value_counts().index, palette=’plasma’) plt.show()

To do sentiment analysis, we need to import a few libraries. Since this is a classification problem, I use LGBMClassifier.

from lightgbm import LGBMClassifier

We need to convert these tweets (texts) to a matrix of token counts.

from sklearn.feature_extraction.text import CountVectorizer

The next step is to normalize the count matrix using tf-idf representation.

from sklearn.feature_extraction.text import TfidfTransformer

I used the pipeline function to do all steps together.

twitter_sentiment = Pipeline([('CVec', CountVectorizer(CountVectorizer(stop_words='english'))),
('Tfidf', TfidfTransformer()),
('norm', Normalizer()),
('tSVD', TruncatedSVD(n_components=100)),
('lgb', LGBMClassifier(n_jobs=-1))])

In the end, CROSS_VALIDATE is used with ROC_AUC metrics.

%%time 
cv_pred = cross_validate(twitter_sentiment, 
tweets[‘text’], 
tweets[‘airline_sentiment’], 
cv=5, 
scoring=(‘roc_auc_ovr’))

The results we have measured using ROC_AUS are as follows.

The complete code can be accessed through this link.

Spread the word

This post was originally published by at Towards Data Science

Related posts