This post was originally published by Paul Hetherington at Medium [AI]
Natural Language Processing (NLP) is a large area of research with many relevant applications for businesses. Being able to take in arbitrary text and extracting sentiment, performing translation, auto-suggest/correct are some typical use cases seen. But the applications are of course endless. Converting text into something simple like a list of numbers seems like a tremendously abstract concept. This is one of my favourite parts of machine learning (ML) you don’t need to go into amazing depth into language itself to produce some useful models. That’s what this article will demonstrate. InduNet V1.0 reads a description of a company (or any text for that matter) and then tells you what industries a company operates in.
This article is intended to explain some of the basics in NLP in ML and to produce a compact industry predictor. All model training was performed on the Neuro API, check it out!
Recently I had the idea to help aide the client prospecting at Neuro by doing some NLP processing on a given companies description to see if they would be a good fit for our product offering. ‘Good fit’ is a very loose term, so to better refine what I mean in this example let’s say ‘good fit’ means that a company operates in relevant industries: artificial intelligence and machine learning.
To summarise what I want to achieve I’ll describe it in terms of inputs and outputs to and from my model that we will cover later:
- Input — A company description
- Output — A list of industries that the company works in
One of the hardest parts of ML is getting and processing data. In ML you have a ‘model’ that you train. Think of the model as a child, and the data as a textbook that they use in school. If the textbook (data) is bad then the child will inevitably not pass their exams. The equivalent of a model passing exams in our use case would be a correct list of industries that it operates in. Good data in our current case means two main things.
- Not biased. If the dataset (data) had only Agriculture based companies, well that’s not very useful for detecting companies in Quantum.
- Lots of it. Not all companies describe Artificial Intelligence in the same way so we need as many examples as possible.
Thankfully there’s big catalogues of companies out there, we will be using Crunchbase! A web crawler (amongst other things) later and it’s very simple to get that data.
The final dataset contained 17.8k companies, across 713 industries. There are two descriptions given per company: long and short, short was used in my dataset. The longest number of words in the short description was 29, and the total number of different words was 27.5k. Company names were included in the vocabulary list. Unfortunately, I cannot make the dataset public but everything you need from the vocabulary and industry list is included in the model at the end.
Making the data useful
Now that we have all of our data, we need to convert it into a numerical format to be used later in our ML model. To do this we assign each word in our vocab list a number. So as an example ‘the’ is now 1362 in our dataset, it’s as simple as that. This covers the inputs to the network, now we must handle the outputs.
A company can have several industries that it operates in, so our model output must be able to handle this. This problem is a multi-class classification problem meaning that we must detect multiple classes (industries) per company. The output from the model will be a list of length 713. If a company operates in a given industry the corresponding position in the list will equal 1, else 0. For example say that we decide item number 45 in the list corresponds to Agriculture, a company only in the Agriculture space will have a 1 in position 45 and 0 everywhere else.
The final size of our dataset is the following:
- Input size [17.8k, 29]. Total of 17.8k companies, maximum description of 29 words.
- Output size [17.8k, 713]. Total of 17.8k companies, 713 industries.
In ML there are several common ways to process sequential language (sentences/text) inputs. There are a few considerations that I’ve found help with the high-level understanding of an NLP model: you don’t always know how many words you need to process (a sentence could have 4 or 64 words), and the model must have context on a given word. Context means that the model must know what came before the word or even the sentence before. Here’s an example: 1. ‘sad’ and 2. ‘he was not sad’. If a model looks only at an individual word without the context it will likely produce poor results. In our example 1. will produce some result assuming that the input is negative but example 2. will produce a result suggesting the sentence is positive. Context matters, enter the Long Short Term Memory (LSTM) Recurrent Neural Network (RNN).
LSTMs use the concept of memory which for us translates to context. When a number (a number that we assigned to a word previously) is passed into an LSTM cell it stores some information about the input before performing some arithmetic on it and then proceeds to output a result. When another number comes in, the LSTM cell has some stored information that is mixed with the new input number. When training the model every LSTM cell slowly optimises the way that it remembers each number or word. We will be using LSTMs.
We have almost everything that we need before we begin constructing the model and training. To increase the chances of our model successfully detecting industries there’s another bit of processing that will drastically increase the quality of inputs to the model.
Back when we assembled our data we assigned numbers to each of our words. Imagine that ‘the’ has a value of 5 and ‘house’ has a value of 6. Mathematically they are very close but in reality, they are not similar, perhaps it would make more sense if ‘flat’ or ‘apartment’ had a value of 5 instead. What I’m alluding to here is that numbers assigned to words will impact the way that the model learns and can therefore negatively impact the learning. To help this a principle called Embedding is used. Embedding is a layer inside of the model that transforms the single number input per word into a list of numbers. This new list of numbers describes how a word relates to all other words, thus removing our previous problem.
To continue the aforementioned example, ‘house’ is no longer numerically close to ‘the’ but instead closer to ‘flat’/’apartment’. As the model trains, it also learns how to better create this new list and perform the Embedding but I will not go into detail on that here.
Now we’re ready! We’ll be building this model with Pytorch and the syntax is very similar using Tensorflow. All compute heavy tasks will be carried out on the Neuro API otherwise it would fry my laptop and take x100 as long.
class InduNet_1(nn.Module): def __init__(self, hidden_vector_dim, lstm_layers, dropout, max_token, embedding_size, output_classes, padding_idx, vocab_tokens, industry_dict):
# Store all of the model init states
self.hidden_vector_dim = hidden_vector_dim
self.lstm_layers = lstm_layers
self.embedding_size = embedding_size
self.max_token = max_token
self.output_classes = output_classes
# The vocab list is inside of the model
# (so you can access it after without needing the file),
# along with the industy list.
self.vocab_tokens = vocab_tokens
self.industry_dict = industry_dict
# Create all of the layers. Embedding default
# is 100 which is very small, but works pretty
# well with the dataset.
self.input_embedding = nn.Embedding(max_token,
self.rnn = nn.LSTM(embedding_size,
self.linear = nn.Linear(hidden_vector_dim,
self.dropout = nn.Dropout(dropout) def forward(self, input_tokens):
embedded_tokens = self.input_embedding(input_tokens)
output, (hidden, cell) = self.rnn(embedded_tokens)
lstm_output = self.dropout(output)
prediction = self.linear(lstm_output)
We will now initialise this model and compile it to the Neuro API using npu. The vocab_tokens and industry_dict used below are avliable directly through the model when exported from the API. You can find out how to get them below. The model has a total size of 41.29MB.
# vocab_tokens/industry_dict is our list (dict) of numbers # associated with words/industries model = InduNet_1(713, 2, 0.01, 17832, embedding_size=100, output_classes=713, padding_idx=2, vocab_tokens=vocab_tokens, industry_dict=industry_dict) npu.api(my_api_token) npu_model = npu.compile(model)
We’re now ready to begin training our model! This was a tricky part of the project and took the longest. As there are many industries to classify and a very large vocabulary list I had the most success using a very small batch size of 8, with a learning rate of 0.0001 in the Adam optimiser. Binary Cross Entropy with logits using a mean reduction was also used.
def multi_classification_loss(pred, label): pred = pred[:,-1,:] torch_loss = torch.nn.BCEWithLogitsLoss(reduction='mean') batch_loss = torch_loss(pred.float(), label.float()) return batch_losstrained_model = npu.train( npu_model, train_data=train_data, val_data=val_data, loss=multi_classification_loss, batch_size=8, epochs=15, optim=npu.optim.Adam(lr=0.0001), )
This post was originally published by Paul Hetherington at Medium [AI]