Stock trend prediction from News Sentiment

Published by FirstAlign

Publicly listed companies sell their shares on the stock market, there are various terminologies related to the stock market, closing price and opening price are a couple of of those terminologies which are the central element of this blog.

The closing price is the price of the share on which the last share is sold in the previous session before the closing of the stock market, and the opening price is the of the first share sold after markets opens if the opening price is higher than the closing price which means the stock at the end of the day showing a drop and if the closing price is higher than the opening price which means that stock has risen here in this blog we are predicting this rise and fall of stock based on the News.

Let’s start 


The first thing required here is the data which we took from the Kaggle. The data contains the news snippets and abstracts for a day related to Apple Inc. This includes stock prices at market open and close. We will use news snippets to predict whether the stock has risen or fallen from the morning. This dataset contains data from 2006 to 2016.

The snapshot of the dataset is given below.

Dataset snapshot

The news snippet and tick data combine to be used as features and labels in our model. As we can see the news column in the second row shows NaN representing the missing value so we must check if our dataset contains any missing values.

Missing value count

The above picture shows that our dataset had 194 missing values in the news column, we need to deal with this. As the news column is of a string type, we remove all rows containing missing values as shown below.

Dropping off missing values

Data Pre-Processing

Once we have dealt with the missing values its time to pre-process the news column ready for conversion into the format passable to the Machine Learning algorithm. Pre-processing follows a few steps;

  1. Removing all characters except for the alphabets. 
  2. Removing the stop words from the text, the high-frequency words which  have little role in deciding the class of text.
  3. Apply Stemming to remove the ambiguity, reduce the words to their root until single versions of the word(s) are visable in the entire dataset.

Once the text is cleaned and processed we apply a countvectorizer to create a sparse matrix of words with their respective frequency and create a full feature set.

This is done, as shown below.

Applying countvectorizer

Now we have features, we can create labels, this case the labels 1 and 0, 1 equals the rise of the stock, 0 a fall. The difference between open and close decides the rise and fall of the stock. Negative numbers all fall, positive numbers a rise.

Creating labels from the open and close column of the dataset

Now we have features and labels we can split them for model evaluation. In this case we used  a 40/60 split, meaning 40% of data is used for testing, and 60% is used for training.

Splitting data for training and testing

Applying Model

We can now apply the algorithm to create the predictive model. For this case we have used three supervised algorithms 1. Logistic Regression, 2. Naïve Bayes and 3. Random Forest. The code snippet for applying each, are shown below;

Logistic regression classifier
Naïve Bayes classifier
Random Forest Classifier

Model Evaluation

After applying the algorithm, we evaluated each model for parameter accuracy, precision, recall and f1 score.

Based on the accuracy we concluded that the Random Forest and Logistic Regression were the algorithms that performed best.

Below are the results of each algorithm.

Results for Logistic Regression
Results for Naïve Bayes
Result for Random Forest


In this blog we have predicted the rise or fall of stock based on the news snippets taken from a Kaggle dataset which had stock data for Apple from 2006 to 2016. We analyzed open, close tick data alongside news columns from this dataset and applied three machine learning model algorithms for prediction (Logistic Regression, Naïve Bayes and Random Forest).[1] [2] 

Based on the evaluation, Random forest proved the best algorithm, but Logistic Regression wasn’t far behind. We looked the profit and loss of the stock over time and related this to keywords within the news of the period. This was used to calculate a Sentiment Value, used as a determining factor in the rise and fall of a stock over time. This article is just one step towards understanding the relationship between stocks and News.

The complete code is available at GitHub hope you enjoyed the article until next stay tuned happy coding ❤

Published by FirstAlign

Click here to connect with us

Spread the word

Related posts