Simplicity is key
“Simplicity is the keynote of all true elegance.”-Abraham Maslow-
*Disclaimer: I am assuming that whoever has the ability to comprehend and execute the content of this article is savvy enough to perform a robust back-testing in every corner of their trading pipeline before actually running it in production. However, there are some considerations that this article doesn’t take into account (spread, slippage and transaction costs, among others) and for that matter, this article is not to be considered financial advice. It is to be considered an educational step towards better performing results.
The goal of this article is to predict if the stock price for Alphabet (GOOGL) will be higher or lower at the end of any given day, using news from the closing time of the previous day’s trading session (4 pm EST for the NYSE) to before market hours of that day (9:30 am EST).
By applying Sentiment Analysis to news reports we will create numerical features and join them to our stock data by using the Proximity Blend Algorithm.
I will add a link to the notebook at the bottom of the article, so you can replicate this study case.
This is a dataset of daily candles for the Alphabet (GOOGL) stock price. We’ll use the last 2 years.
We want to predict the percentage of ‘change’ between a given day’s closing price with respect to the open.
df_alphabet['change'] = np.log(df_alphabet['close']) - np.log(df_alphabet['open'])
Let’s look at the open and close prices plotted against the change.
If we zoom in, we can see that the green points are above the ‘zero change’ line when the closing price is higher than the opening price and vice versa. This means the green dots above the line is where investing would generate a positive return for that day.
Our goal is to predict if the change will be higher than zero for any given day using information from before that day so that we can decide to buy at the opening and sell at the close.
This is what the change distribution looks like:
743 times change is positive (53%)
639 times change is negative (47%)
So the coin is already ‘tilted’ in our favor.
Let’s create a success/failure target feature by binarizing the change and adding a small handicap (0.003) to compensate for slippage, spread and transaction costs. (so that the investment results in a positive return).
df_alphabet['target'] = [1 if ch > 0.003 else 0 for ch in df_alphabet['change']]
Now we have a much more ‘inconvenient’ coin.
551 times target is 1 (39%)
832 times target is 0 (61%)
Our goal is to detect success on a proportion that generates income. We want to filter and find more days of positive returns to invest in (precision > 0.5).
To put things simple, a supervised ML model can be seen as a dependent variable explained by independent variables. When applying sentiment analysis, the independent variables come from text.
In our case (and in many cases), that dependent variable ‘ ŝ ’ goes to feed a second ‘dependent’ model with a particular use case for the sentiment (such as stock prediction).
Where x is our target variable.
So, before we build a dependent model to predict our target, we need to create an independent model toclassify news sentiment so we can use the result as input.
Sentiment Analysis is the art of quantifying subjective information.
The ‘sentiment’ representation can be seen as a multi-dimensional space filled with vectors facing many directions (as seen in ‘Multi-Dimensional Sentiment Analysis with Learned Representations’).
But for the purpose of this article, we will simplify and reduce sentiment to a numerical feature for every news headline.
Every headline will have a ‘sentiment’ feature representing how positive or negative the news are with respect to a potential stock movement.
Now, let’s take a look at some -aggregated by day- news that mention google.
The subjective part is our opinion of what are positive or negative headlines for the stock price. We need to manually label a set of news for the model to be trained with that knowledge.
I manually labeled over 500 news headlines from over 1,300 days of ‘google’ news (that’s right, I earned my up-vote) adding a sentiment of -1, 0 or 1.
Now, we need to come up with a model that can train to classify sentiment on new texts so we can automate this job and instantly use the sentiment as ML fuel for our target.
To accomplish this, we need to transform the text into a numerical format so we can apply ML. For this we need to tokenize and vectorize the text.
Tokenizing is separating strings in words or groups of words (called n-grams). For example, let’s create n-grams for ‘1 and 2 words’ for one text:
Vectorizing is creating a sparse matrix out of an array of texts, where the columns are the n-grams and the rows are the mentions on each text.
Let’s tokenize the texts and add them as numerical features:
We added 1550 columns to our sentiment dataframe. Now, let’s look at the top correlated (negative and positive) tokens with ‘sentiment’.
Of course, these don’t mean much by themselves so let’s create a machine learning model that combines them to predict sentiment.
First, a simple model to detect ‘positive’ news. Training with 365 and testing with 157.
It correctly detected 7 positive news,incorrectly labeled 1 and missed 28. It correctly discarded 121 ‘not positive’ news.
Now the exact same process to detect ‘negative’ news this time.
It correctly detected 12, incorrectly labeled 7 as ‘negative’ and missed on 28 ‘negative’ news. It correctly filtered out 110 ‘not negative’ news.
There was no overlap of classification between the two models (no news were both labeled ‘positive’ and ‘negative’).
Though there is huge room for improvement (by pre-processing data, model/parameter optimization, stemming and lemmatization, adding more data to train with, semantics analysis, running it through a neural network, etc..) this is good and simple enough for our main purpose.
Price Prediction with News
Let’s take a look at the sentiment along with the price.
The green lines are ‘positive’ news and the red ones ‘negative’.
Let’s remove the price curve.
At simple sight, it is hard to see the relationship because there are too many ‘change’ points.
Let’s blend the news again and calculate an average sentiment for the last 24 hours using our ‘sentiment’ model.
Let’s try to predict if the ‘change’ will be positive (higher than our 0.003 handicap) with the variables we now have from the day before.
Again, trying out a simple model, training with a prior (chronologically) train set of 1078 observations and testing with a posterior one of 304.
Our model correctly predicted 85 price increments (change > 0.003), it wrongly accepted 65 ‘non-increment’ cases. It missed 50 increments and it correctly filtered out 104 ‘non-increment’ cases.
So, given that there are so much factors that influence the stock prices, getting these results with just news headlines and the information contained in the candles is quite outstanding.
Remember, there is no free money out there. Every investment involves a risk. It is always important to understand not just the math and the models but what those models represent.
You can download the notebook to reproduce all of this here.