A step-by-step tutorial for conducting Sentiment Analysis

towards-data-science

This post was originally published by Zijing Zhu at Towards Data Science

Part 1: preprocessing text data

It is estimated that 80% of the world’s data is unstructured. Thus deriving information from unstructured data is an essential part of data analysis. Text mining is the process of deriving valuable insights from unstructured text data, and sentiment analysis is one applicant of text mining. It is using natural language processing and machine learning techniques to understand and classify subjective emotions from text data. In business settings, sentiment analysis is widely used in understanding customer reviews, detecting spams from emails, etc. This article is the first part of the tutorial that introduces the specific techniques used to conduct sentiment analysis with Python. To illustrate the procedures better, I will use one of my projects as an example, where I conduct news sentiment analysis on WTI crude oil future prices. I will present the important steps along with the corresponded Python code.

Some background information

The crude oil future prices have large short-run fluctuations. While the long-run equilibrium of any product is determined by the demand and supply conditions, the short-run fluctuations in prices are reflections of the market confidence and expectations toward this product. In this project, I use crude oil related news articles to capture constantly updating market confidence and expectations, and predict the change of crude oil future prices by conducting sentiment analysis on news articles. Here are the steps to complete this analysis:

1, collecting data: web scraping news articles

I will discuss the second part, which is preprocessing the text data in this article. If you are interested in other parts, please follow the links to read more (coming up).

Preprocessing text data

I use tools from NLTK, Spacy, and some regular expressions to preprocess the news articles. To import the libraries and use the pre-built models in Spacy, you can use the following code:

import spacy
import nltk
# Initialize spacy ‘en’ model, keeping only component needed for lemmatization and creating an engine:nlp = spacy.load(‘en’, disable=[‘parser’, ‘ner’])

Afterwards, I use pandas to read in the data:

The “Subject” and “Body” are the columns that I will apply text preprocessing procedures on. I preprocessed the news articles following the standard text mining procedures to extract useful features from the news contents, including tokenization, removing stopwords, and lemmatization.

Tokenization

The first step of preprocessing text data is to break every sentence into individual words, which is called tokenization. Taking individual words rather than sentences breaks down the connections between words. However, it is a common method used to analyze large sets of text data. It is efficient and convenient for computers to analyze the text data by examines what words appear in an article and how many times these words appear, and is sufficient enough to give insightful results.

Take the first news article in my dataset as an example:

You can use the NLTK tokenizer:

Or you can use Spacy, remember nlp is the Spacy engine defined above:

need to change each token to string variable

After tokenization, each news article will transform into a list of words, symbols, digits, and punctuation. You can specify whether you want to transform every word into a lowercase as well. The next step is to remove useless information. For example, symbols, digits, punctuations. I will use spacy combined with regex to remove them.

import re#tokenization and remove punctuations
words = [str(token) for token in nlp(text) if not token.is_punct]
#remove digits and other symbols except "@"--used to remove email
words = [re.sub(r"[^A-Za-z@]", "", word) for word in words]
#remove websites and email address
words = [re.sub(r”S+com”, “”, word) for word in words]
words = [re.sub(r”S+@S+”, “”, word) for word in words]
#remove empty spaces
words = [word for word in words if word!=’ ‘]

After applying the transformations above, this is how the original news article looks like:

Stopwords

After some transformation, the news article is much cleaner, but we still see some words we do not desire, for example, “and”, “we”, etc. The next step is to remove the useless words, namely, the stopwords. Stopwords are words that frequently appear in many articles, but without significant meanings. Examples of stopwords are ‘I’, ‘the’, ‘a’, ‘of’. These are the words that will not intervene in the understanding of articles if removed. To remove the stopwords, we can import the stopwords from the NLTK library. Besides, I also include other lists of stopwords that are widely used in economic analysis, including dates and time, more general words that are not economically meaningful, etc. This is how I construct the list of stopwords:

#import other lists of stopwords
with open(‘StopWords_GenericLong.txt’, ‘r’) as f:
x_gl = f.readlines()
with open(‘StopWords_Names.txt’, ‘r’) as f:
x_n = f.readlines()
with open(‘StopWords_DatesandNumbers.txt’, ‘r’) as f:
x_d = f.readlines()
#import nltk stopwords
stopwords = nltk.corpus.stopwords.words(‘english’)
#combine all stopwords
[stopwords.append(x.rstrip()) for x in x_gl][stopwords.append(x.rstrip()) for x in x_n][stopwords.append(x.rstrip()) for x in x_d]
#change all stopwords into lowercase
stopwords_lower = [s.lower() for s in stopwords]

and then exclude the stopwords from the news articles:

words = [word.lower() for word in words if word.lower() not in stopwords_lower]

Applying to the previous example, this is how it looks like:

Lemmatization

Removing stopwords, along with symbols, digits, and punctuation, each news article will transform into a list of meaningful words. However, to count the appearance of each word, it is essential to remove grammar tense and transform each word into its original form. For example, if we want to calculate how many times the word ‘open’ appears in a news article, we need to count the appearances of ‘open’, ‘opens’, ‘opened’. Thus, lemmatization is an essential step for text transformation. Another way of converting words to its original form is called stemming. Here is the difference between them:

Lemmatization is taking a word into its original lemma, and stemming is taking the linguistic root of a word. I choose lemmatization over stemming because after stemming, some words become hard to understand. For the interpretation purpose, the lemma is better than the linguistic root.

As shown above, lemmatization is very easy to implement with Spacy, where I call the .lemma_ function from spacy at the beginning. After lemmatization, each news article will transform into a list of words that are all in their original forms. The news article now changed into this:

Summarize the steps

Let’s summary the steps in a function and apply the function in all articles:

def text_preprocessing(str_input):      #tokenization, remove punctuation, lemmatization
words=[token.lemma_ for token in nlp(str_input) if not token.is_punct]

# remove symbols, websites, email addresses
words = [re.sub(r”[^A-Za-z@]”, “”, word) for word in words]
words = [re.sub(r”S+com”, “”, word) for word in words]
words = [re.sub(r”S+@S+”, “”, word) for word in words]
words = [word for word in words if word!=’ ‘]
words = [word for word in words if len(word)!=0]

#remove stopwords
words=[word.lower() for word in words if word.lower() not in stopwords_lower]

     #combine a list into one string   
string = “ “.join(words)
return string

The function above, text_preprocessing() combines all the text preprocessing steps, here is output with the first news article:

Before generalizing into all news articles, it is important to apply it on random news articles and see how it works, following the code below:

import randomindex = random.randint(0, df.shape[0])
text_preprocessing(df.iloc[index][‘Body’])

If there are some extra words you want to exclude for this particular project, or some extra redundant information you want to remove, you can always revise the function before applying to all news articles. Here is a piece of randomly selected news article before and after tokenization, removing stopwords and lemmatization.

news article before preprocessing

news article after preprocessing

If all looks great, you can apply the function to all news articles:

df[‘news_cleaned’]=df[‘Body’].apply(text_preprocessing)
df[‘subject_cleaned’]=df[‘Subject’].apply(text_preprocessing)

Some remarks

Text preprocessing is a very important part of text mining and sentiment analysis. There are a lot of ways of preprocessing the unstructured data to make it readable for computers for future analysis. For the next step, I will discuss the vectorizer I used to transform text data into a sparse matrix so that they can be used as input for quantitative analysis.

If your analysis is simple and does not require a lot of customization in preprocessing the text data, the vectorizers usually have embedded functions to conduct the basic steps, like tokenization, removing stopwords. Or you can write your own function and specify your customized function in the vectorizer so you can preprocess and vectorize your data at the same time. If you want this way, your function needs to return a list of tokenized words rather than a long string. However, personally speaking, I prefer to preprocess the text data first before vectorization. In this way, I keep monitoring the performance of my function, and it is actually faster especially if you have a large data set.

I will discuss the transformation process in my next article. Thank you for reading!

Spread the word

This post was originally published by Zijing Zhu at Towards Data Science

Related posts