Predicting Sentiment of employee reviews


This post was originally published by Kamil Mysiak at Towards Data Science

In my previous articles, we learned how to scrape, process, and analyze employee reviews from Feel free to take a look and offer feedback. I would love to hear how you would improve the code. In particular, how to dynamically overcome changes to the website’s HTML.

In this article, I would like to take our dataset a step further to solve a sentiment classification problem. Specifically, we will be assigning sentiment targets to each review and then using a binary classification algorithm to predict those targets.

We’ll be importing raw employee reviews which we scraped from I would suggest you review my previous article to understand how we were able to obtain this dataset.

First, let’s import our dataset and get processing. We only need the ‘rating’ and ‘rating_description’ columns for this analysis. For a more detailed explanation of text pre-processing head over to my nlp preprocessing article.

Removing Contractions

Converting List of Words to String

The result of applying the “fix” function is a list of words and we need to convert the list back to a string.

Removing Non-English Reviews

First, we will use the ‘fasttext’ library to identify the language of each review. Use this link to download the pre-trained model. Once each review is labeled we simply filter the dataframe to include only the english reviews.


Removing Punctuations and Special Characters

Converting To Lowercase


Removing any Numerical Characters (ie. integers, floats)


In order to apply lemmatization to our tokens we first need to identify the parts of speech for each word. Next, we need to convert the tags obtained from nltk to wordnet tags as those are the only tags accepted by our lemmatizer (WordNetLemmatizer()).

Removing Stopwords

Removing Words Less than 3 Characters Long

What we are left with is a large dataframe which contains a column for each preprocessing step we performed. I prefer to work in this manner as we can easily see the changes before and after each step. Makes data validation much easier if you ask me.

Determining Sentiment

We can utilize a library such as “TextBlob.Sentiment” to calculate the sentiment of each review but since we have the actual employee rating we can use this column instead.

The employee rated the company on a 5-point Likert Scale and over 75% of the ratings are positive (ie. 5 or 4). Furthermore, a 5-point scale has a rating of 3 (neutral) which can be positive or negative. For simplicity reasons we are going to remove neutral ratings. Ratings between 1 and 2 will be negative and ratings between 4 and 5 will be positive.

Finally, we’ll filter the dataframe to include only the columns we’ll need for training our model.

Visualize the Data

Let’s take a second and visualize our data based on how positive or negative each word is (ie. sentiment). To accomplish this task, we need to calculate the frequency of each unique word in our vocabulary as appears in positive sentiment reviews and negative sentiment reviews. The final result will resemble the list below, where the word ‘work’ appears 425 times in negative sentiment reviews and 4,674 times in positive sentiment reviews.

Here is the result the code below will produce

In order to calculate the list above, we first sort the reviews by sentiment in descending order and reset the index. Next, we can see that we have 3,570 positive and 1,039 negative sentiment reviews. Now we split our dataset into positive and negative reviews and then create an array of 3,570 ones and 1,039 zeros.

Next, we create an empty dictionary named “frequencies” which will house the output. Now we iterate over zipped (ie. tuples) reviews and the targets array(3,570 zeros and 404 ones). As we iterate over each word in each review, we create a pair in the form of a tuple composed of the word and its sentiment (ie. 1 or 0). We know that there are 3,570 positive reviews, therefore, only the words in the first 3,570 reviews will be assigned as positive. All the words in the remaining 404 reviews will be assigned as negative. The end result is a dictionary of tuple keys of the word and sentiment, and the positive and negative word frequencies as the values.

For example, the word “work” appears 425 times in all positive reviews.

Now the last thing we need to do is sum the positive and negative counts for each word from the “frequencies” dictionary.

First, we index the word in the frequencies dictionary keys creating a list of all the words (stored in “words”). Next, we initiate count variables of pos_count and neg_count. If the tuple key “word 1.0” appears in frequencies dictionary (ie. ‘work’, 1.0) then we index the value for that key and the “neg_count” variable becomes the number of times the word appears in negative reviews. The same is done for the words which appear in positives reviews. Finally, we append the word, pos_count, and neg_count to the “data” list.

As you can see if we plot all the words the plot becomes a bit messy, therefore, let’s select a few random words. Since there are more positive reviews the plot is skewed towards the positive end.

Feature Extraction

Instead of going down the route of One-Hot Encoding or CountVectorizing which would create an enormous sparse matrix, (want to visualize a sparse matrix check this out) we are going to create a vectorize feature-set of just two features. The first feature will be the sum of all the negative frequencies from the “frequencies” dictionary for every unique word in the review. The second feature will be the sum of all the frequencies from the “frequencies” dictionary for every unique positive word in the review.

For example, looking at the first feature (ie. negative totals), let’s assume we have a review with the following words [‘work’, ‘apple’, ‘contractor’]. Looking at the previously calculated “frequencies” dictionary, the feature 1 value for this review would be 739 or (425+279+35=739).

First, we create a spare 1×2 numpy array. Looping through each word in a review, if the (word, 1) (ie. ‘work’, 1) appears as a key in the frequencies dictionary we are going to index its value. The value gets assigned to the first column in the “x” array. If the word does not exist, a zero gets assigned to the first column in the “x” array. The for-loop then looks to see if the same word but with positive sentiment (ie. ‘apple’, 0) appears in the frequency dictionary. If so, the index value gets assigned to the second column in the “x” array. As the for-loop finishes its first iteration it assigns x[0,0] and x[0,1] values to the first row in corresponding columns in “X” array.

We can see in the resulting dataframe below, the words in the first review had a sum of 6,790 for the negative words and 54,693 for the positive words.

Train Test Split

Classifier Evaluation

We are going to evaluate six classifiers, Logistic Regression, Random Forest, KNN, Naive Bayes, Support Vector Classifier and Gradient Boosting Classifier. We definitely want our model to favor an accurate prediction of negative reviews (True Positives) as those give us insights into organizational problems. That said, we do not want an overly “picky” model which will only predict a negative review to be negative if it’s absolutely certain. We ultimately want a more balanced model which favors accurate negative review predictions (True Positives) but also does a good job correctly classifying positive reviews. Therefore, from a metric perspective we want a high recall to correctly predict as many true negative reviews as possible but also a good precision in order to minimize false negatives. Furthermore, a high AUC score would indicate the model has a high probability of correctly classifying positive and negative reviews.

In order to obtain a more accurate model evaluation and avoid overfitting we will use Stratified K-Fold Cross Validation. This way each model will be evaluated k-times on a different split of training and test data from our original training data. Metrics will be calculated at each fold and grand averages will be taken of all the folds. We will compare the average training and test metrics at each fold to determine which classifier might be overfitting. Furthermore, due to the class imbalance of the minority class which is present in the training data, we’ll use the SMOTE technique at each fold to oversample and balance the target. Let’s take a look how our classifiers performed.

Unfortunately, random forest and KNN are severely overfitting the data. Naive Bayes and SVC have very high test recall but the models have very low precision, leading to a very high false negative rate. Finally, Logistic Regression and Gradient Boosting Classifier have very similar scores but the latter has slightly higher precision and AUC. It would seem it is a slightly more balanced model.

Hyperparameter Tuning

Gradient Boosting Classifier has a wide range of hyperparameters we can tune. Typically, n_estimators (ie. number of trees), max_depth, and learning_rate are thought of as most important parameters. We are going to examine the learning rate, n_estimators, max_depth, min_samples_split and min_samples_leaf.

Keep in mind these hyperparameter ranges were a result of multiple iterations. We would begin with a wide range (ie. range(10,1000, 100) and narrow down to a more specific range (ie. 800, 1000, 1) based on the obtained scores. It seems a learning rate of 0.01 and n_estimators (ie. number of trees) of 959 is best.

Now we turn to max_depth or the depth each decision tree can be built. Increasing the depth enables the model to capture more information (increased complexity) but there is a level of diminishing returns as too many levels the model will begin to overfit.

It seems a max_depth of 2 is optimal, increasing the depth we can see the model quickly begins to overfit.

Min_samples_split or the minimum number of sample required to split an internal node. In other words, if we set this parameter to 2, the node will require at least two records/reviews in order to split into two nodes. Higher values help with overfitting as it forces the decision tree to require more records before it splits. More splits = more depth = increased complexity = overfitting.

Min_samples_leaf is the minimum number of samples required to form a leaf node. In other words, each leaf must have at least min_samples_leaf reviews that it classifies as positive or negative. It seems 26 reviews is the optimal number.

Model Evaluation

Applying our optimized model onto our held-out test data we can see a marginal improvement. Overall, the AUC went up slightly (0.728 to 0.734) along with our recall for the positive review class (74% to 76%). We were able to increase our true positive count by 11 reviews. Our model is still misclassifying a large number of positive reviews as negative (217 False Positives). That said, overall it is doing a fairly good job correctly classifying true positive reviews (76%) and true negative reviews (71%).

Optimized Model

Base Model

I hope you found this article helpful and informative in your data science ventures. Like always, I welcome any and all feedback.

Spread the word

This post was originally published by Kamil Mysiak at Towards Data Science

Related posts