Build SMS Spam Classification Model using Naive Bayes & Random Forest

towards-data-science

This post was originally published by at Towards Data Science

If you are into data science and looking for starter projects then the SMS Spam classification Project is one of those you should work upon! In this tutorial, we would go step by step from importing libraries to full model prediction and lately measuring the accuracy of the model.

A good text classifier is a classifier that efficiently categorizes large sets of text documents in a reasonable time frame and with acceptable accuracy, and that provides classification rules that are humanly readable for possible fine-tuning. If the training of the classifier is also quick, this could become in some application domains a good asset for the classifier. Many techniques and algorithms for automatic text categorization have been devised.

The text classification task can be defined as assigning category labels to new documents based on the knowledge gained in a classification system at the training stage. In the training phase, we are given a set of documents with class labels attached, and a classification system is built using a learning method. Classification is an important task in both data mining and machine learning communities, however, most of the learning approaches in text categorization are coming from machine learning research.

For this project, I would be using Google Colab, but you can use python Notebook also for the same purpose.

Importing of Libraries

First, we would import the required libraries such as pandas, matplotlib, numpy, sklearn


Note: the last line of the code snippet can be removed if you are not using Google Colab. This last line is for mounting my Google Drive over Google Colab so that I can use the dataset present in my drive.

Importing the dataset

I would be uploading the dataset in my GitHub repo which can be found here.

After downloading the dataset we would import it using pandas’ read_csv function.


Note: Please use your own path for the dataset.

Now as we have imported the dataset, let’s see if we have imported the dataset incorrect format or not by using head() function.


From the above dataset snippet, I see that we have the column names which we don’t require! Thus now comes the task of cleaning and reformatting the data for us to use it to build our model.

Data Cleaning & Exploration

Now we have to remove unnamed columns. To do so we would use the drop function.


Now, the next task is to rename the columns v1 and v2 to label and message respectively!


Now, additionally (its an optional step but its always good to do some data exploration also 😛 )


Next thing we want to know how many messages are ham and how many messages are spam in our dataset. For that:


Explanation: Here we set the sort = True and use the value_counts method of Pandas. This code would make a bar plot of green and red color respectively for spam and not spam classes.

The output you might be getting would be similar to this:

We see that we have a lot of ham messages whereas less spam messages. In this tutorial, we would go on forward with this dataset only without augmenting it (no oversampling/under sampling) I would do here.

So first let me encode spam and not spam messages as 1 and 0 respectively.

dataset[“label”]=dataset[“label”].map({‘spam’:1,’ham’:0})
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, dataset[‘label’], test_size=0.70, random_state=42)


Now, the second line of the above code snippet uses the sklearn library splot method to split the data into training and testing dataset. Here I have mentioned the test data size to be 70 percent of the whole dataset. (You can change it according to your wish here )

BONUS: DONT KNOW ABOUT SPLITTING OF DATASET AND ITS BENEFITS? READ MY THIS ARTICLE WHERE I EXPLAINED ALL!

Now I would be using the Multinomial Naive Bayes algorithm!


As you can see that I have incorporated a recall test and precision test also to access my model more accurately as how much good my model is performing.

Now for different values of alpha, I would make a table to see various measures such as Train Accuracy, Test Accuracy, Test Recall, Test Precision.


Now we have to see the best index for Test Precision, as I am concerned more about it here. Note that it’s not always that we have to use Precision to evaluate our model. It depends upon your use cases always!


I would be using RandomForestClassifier function with n_estimators be 100 (you can change this according to your will to get the optimum results)


In the above code snippet, last time I fit my model with X_train and y_train.

Now, let’s see the predictions. I would be using predict function and calculating Precision, Recall , f- score, and Accuracy measure also.


Model Evaluation


Thus we see that our model’s accuracy is approx 96 percent which is I think pretty decent. Its precision value is also close to 1, again a decent value.

In my next article, I would use NLP and Neural Network and explain how we can get a more accurate model!

Spread the word

This post was originally published by at Towards Data Science

Related posts