Collaborative filtering in Pytorch


This post was originally published by Neel Iyer at Towards Data Science

Building a Neural Network with Embeddings for Movie Recommendations

Collaborative filtering is a tool that companies are increasingly using. Netflix uses it to recommend shows for you to watch. Facebook uses it to recommend who you should be friends with. Spotify uses it to recommend playlists and songs. It’s incredibly useful in recommending products to customers.

In this post, I construct a collaborative filtering neural network with embeddings to understand how users would feel towards certain movies. From this, we can recommend movies for them to watch.

The dataset is taken from here. This code is loosely based off the fastai notebook.

First, let get rid of the annoyingly complex user ids. We can make do with plain old integers. They’re much easier to handle.

import pandas as pd
ratings = pd.read_csv('ratings.csv')
movies = pd.read_csv('movies.csv')

Then we’ll do the same thing for movie ids as well.

u_uniq = ratings.userId.unique()
user2idx = {o:i for i,o in enumerate(u_uniq)}
ratings.userId = ratings.userId.apply(lambda x: user2idx[x])

We’ll need to get the number of users and the number of movies.


First, let’s create some random weights. We need to call. This allows us to avoid calling the base class explicitly. This makes the code more maintainable.

These weights will be uniformly distributed between 0 and 0.05. The _ operator at the end of uniform_ denotes an inplace operation.

Next, we add our Embedding matrices and latent factors.

We’re creating an embedding matrix for our user ids and our movie ids. An embedding is basically an array lookup. When we multiply our one-hot encoded user ids by our weights most calculations cancel to 0 (0 * number = 0). All we’re left with is a particular row in the weight matrix. That’s basically just an array lookup.

So we don’t need the matrix multiply and we don’t need the one-hot encoded array. Instead, we can just do an array lookup. This reduces memory usage and speeds up the neural network. It also reveals the intrinsic properties of the categorical variables. This idea was applied in a recent Kaggle competition and achieved 3rd place.

The size of these embedding matrices will be determined by n_factors. These factors determine the number of latent factors in our dataset.

Latent factors are immensely useful in our network. They reduce the need for feature engineering. For example, if User_id 554 likes Tom cruise and Tom cruise appears in a movie. User 554 will probably like the movie. Tom cruise appearing in a movie would be a latent feature. We didn’t specify it before training. It just showed up. And we’re glad that it did.

Finally, we’ll need to add our forward function.

As the name of this class would suggest we’re doing a dot product of embedding matrices.

users,movies = cats[:,0],cats[:,1] gives us a minibatch of users and movies. We only look at categorical variables for embeddings. conts refers to continuous variables.

This minibatch size will be determined by the batchsize that you set. According to this paper, a large batch size can actually compromise the quality of the model. But according to this paper, a large batch size increases the quality of the model. There is no consensus at the moment. Many people are reporting contradictory results. So feel free to experiment with a batch size of your choosing.

From that minibatch, we want to do an array lookup in our embedding matrix.

self.u(users),self.m(movies) allows us to do that array lookup. This lookup is less computationally intensive that a matrix multiply of a one-hot encoded matrix and a weight matrix.

(u*m).sum(1).view(-1, 1) is a cross product of the embeddings for users and movies and returns a single number. This is the predicted rating for that movie.

Next, we need to create a ColumnarModelData object

Then I’ll set up an optimiser. I’ll use stochastic gradient descent for this. optim.SGD implements stochastic gradient descent. Stochastistic gradient descent is computationally less intensive than gradient descent. This is because we introduce randomness when selecting the data point to calculate the derivative.

We could also use optim.Adam. That implements rmsprop and momentum. In turn that results in an adaptive learning rate. But this paper shows that the solutions derived from SGD generalize far better than the solutions obtained from Adam. Plus it doesn’t take that long to train anyway, so SGD isn’t a bad option.

Then we fit for a 3 epochs.

fit(model, data, 3, opt, F.mse_loss)

MSE loss is simply mean square error loss. This is calculated automatically.

Fastai creates a neural net automatically behind the scenes. You can call a collab_learner which automatically creates a neural network for collaborative filtering. Fastai also has options for introducing Bias and dropout through this collab learner.

Bias is very useful. We need to find user bias and movie bias. User bias would account for people who give high ratings for every movie. Movie bias would account for people who tend to give high ratings for a certain type of movie. Fastai adds in Bias automatically.

Using fastai we can create a collab learner easily:

Bias is very useful. We need to find user bias and movie bias. User bias would account for people who give high ratings for every movie. Movie bias would account for people who tend to give high ratings for a certain type of movie. Fastai adds in Bias automatically.

Interestingly, fastai notes that you should be increase the y_range slightly. A sigmoid function is used to ensure that the final output is between the numbers specified in y_range. The issue is that a sigmoid function asymtotes. So we’ll need to increase our y_range slightly. Fastai recommends increasing by 0.5.

Image from ResearchGate

I’m using the suggested learning rate here with a small amount of weight decay. This is the combination that I found to work really well.

We can train some more

We finally get a MSE of 0.784105. But it’s a very bumpy ride. Our loss jumps up and down considerably. That said 0.784105 is actually a better score than the LibRec system for collaborative filtering. They were getting 0.91**2 = 0.83 MSE.

It’s also actually slightly better than the model that fastai created in their collaborative filtering lesson. They were getting 0.814652 MSE.

  1. We can adjust the size of the embedding by sending in a dictionary called emb_szs. This could be a useful parameter to adjust.
  2. Content-based recommendation. Collaborative filtering is just one method of building a recommendation system. Other methods could be more useful. A Content-based system is something I’m keeping in mind. That could look at metadata such as cast, crew, genre and director to make recommendations. I think some kind of hybrid solution would be optimal. This would combination a content-based recommendation system and a collaborative filtering system.
  3. Collaborative filtering is largely undermined by the cold-start problem. To overcome this we could potentially look at the users metadata. For example, we could look at things like: gender, age, city, time they accessed the site, etc. Just all the things they entered on the sign up form. Building a model on that data could be tricky, but if it works well it could be useful.
Spread the word

This post was originally published by Neel Iyer at Towards Data Science

Related posts