POS tagging using RNN

mediumThis post was originally published by Tanya Dayanand at Medium [AI]

Learn how to use RNNs to tag words in an English corpus with their part-of-speech (POS) tag

four paper card tags
Photo by Angèle Kamp on Unsplash

The classical way of doing POS tagging is using some variant of Hidden Markov Model. Here we’ll see how we could do that using Recurrent neural networks. The original RNN architecture has some variants too. It has a novel RNN architecture — the Bidirectional RNN which is capable of reading sequences in the ‘reverse order’ as well and has proven to boost performance significantly.

Then two important cutting-edge variants of the RNN which have made it possible to train large networks on real datasets. Although RNNs are capable of solving a variety of sequence problems, their architecture itself is their biggest enemy due to the problems of exploding and vanishing gradients that occur during the training of RNNs. This problem is solved by two popular gated RNN architectures — the Long, Short Term Memory (LSTM) and the Gated Recurrent Unit (GRU). We’ll look into all these models here with respect to POS tagging.

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, or simply POS-tagging. The NLTK library has a number of corpora that contain words and their POS tag. I will be using the POS tagged corpora i.e treebank, conll2000, and brown from NLTK to demonstrate the key concepts. To get into the codes directly, an accompanying notebook is published on Kaggle.

The following table provides information about some of the major tags:

POS-tagging
  1. Preprocess data
  2. Word Embeddings
  3. Vanilla RNN
  4. LSTM
  5. GRU
  6. Bidirectional LSTM
  7. Model Evaluation

Let’s begin with importing the necessary libraries and loading the dataset. This is a requisite step in every data analysis process(The complete code can be viewed here). We’ll be loading the data first using three well-known text corpora and taking the union of those.

# Importing and Loading the data into data frame
# load POS tagged corpora from NLTK
treebank_corpus = treebank.tagged_sents(tagset='universal')
brown_corpus = brown.tagged_sents(tagset='universal')
conll_corpus = conll2000.tagged_sents(tagset='universal')

# Merging the dataframes to create a master df
tagged_sentences = treebank_corpus + brown_corpus + conll_corpus

As a part of preprocessing, we’ll be performing various steps such as dividing data into words and tags, Vectorise X and Y, and Pad sequences.

Let’s look at the data first. For each of the words below, there is a tag associated with it.

# let's look at the data
tagged_sentences[7]

Divide data in words (X) and tags (Y)

Since this is a many-to-many problem, each data point will be a different sentence of the corpora. Each data point will have multiple words in the input sequence. This is what we will refer to as X. Each word will have its corresponding tag in the output sequence. This what we will refer to as Y. Sample dataset:

X = [] # store input sequence
Y = [] # store output sequencefor sentence in tagged_sentences:
X_sentence = []
Y_sentence = []
for entity in sentence: 
X_sentence.append(entity[0]) # entity[0] contains the word
Y_sentence.append(entity[1]) # entity[1] contains corresponding tag

X.append(X_sentence)
Y.append(Y_sentence)

num_words = len(set([word.lower() for sentence in X for word in sentence]))
num_tags   = len(set([word.lower() for sentence in Y for word in sentence]))print("Total number of tagged sentences: {}".format(len(X)))
print("Vocabulary size: {}".format(num_words))
print("Total number of tags: {}".format(num_tags))
# let’s look at first data point
# this is one data point that will be fed to the RNN
print(‘sample X: ‘, X[0], ‘n’)
print(‘sample Y: ‘, Y[0], ‘n’)
# In this many-to-many problem, the length of each input and output sequence must be the same.
# Since each word is tagged, it’s important to make sure that the length of input sequence equals the output sequenceprint(“Length of first input sequence : {}”.format(len(X[0])))
print(“Length of first output sequence : {}”.format(len(Y[0])))

The next thing we need to figure out is how are we going to feed these inputs to an RNN. If we have to give the words as input to any neural networks then we essentially have to convert them into numbers. We need to create a word embedding or one-hot vectors i.e. a vector of numbers form of each word. To start with this we’ll first encode the input and output which will give a blind unique id to each word in the entire corpus for input data. On the other hand, we have the Y matrix(tags/output data). We have twelve POS tags here, treating each of them as a class and each pos tag is converted into one-hot encoding of length twelve. We’ll use the Tokenizer() function from Keras library to encode text sequence to integer sequence.

Vectorise X and Y

# encode X
word_tokenizer = Tokenizer()              # instantiate tokeniser
word_tokenizer.fit_on_texts(X)            # fit tokeniser on data# use the tokeniser to encode input sequence
X_encoded = word_tokenizer.texts_to_sequences(X)  # encode Y
tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(Y)
Y_encoded = tag_tokenizer.texts_to_sequences(Y)# look at first encoded data point
print("** Raw data point **", "n", "-"*100, "n")
print('X: ', X[0], 'n')
print('Y: ', Y[0], 'n')
print()
print("** Encoded data point **", "n", "-"*100, "n")
print('X: ', X_encoded[0], 'n')
print('Y: ', Y_encoded[0], 'n')

Make sure that each sequence of input and output is of the same length.

Pad sequences

The sentences in the corpus are not of the same length. Before we feed the input in the RNN model we need to fix the length of the sentences. We cannot dynamically allocate memory required to process each sentence in the corpus as they are of different lengths. Therefore the next step after encoding the data is to define the sequence lengths. We need to either pad short sentences or truncate long sentences to a fixed length. This fixed length, however, is a hyperparameter.

# Pad each sequence to MAX_SEQ_LENGTH using KERAS’ pad_sequences() function. 
# Sentences longer than MAX_SEQ_LENGTH are truncated.
# Sentences shorter than MAX_SEQ_LENGTH are padded with zeroes.# Truncation and padding can either be ‘pre’ or ‘post’. 
# For padding we are using ‘pre’ padding type, that is, add zeroes on the left side.
# For truncation, we are using ‘post’, that is, truncate a sentence from right side.# sequences greater than 100 in length will be truncated
MAX_SEQ_LENGTH = 100X_padded = pad_sequences(X_encoded, maxlen=MAX_SEQ_LENGTH, padding=”pre”, truncating=”post”)
Y_padded = pad_sequences(Y_encoded, maxlen=MAX_SEQ_LENGTH, padding=”pre”, truncating=”post”)# print the first sequence
print(X_padded[0], "n"*3)
print(Y_padded[0])

You know that a better way (than one-hot vectors) to represent text is word embeddings. Currently, each word and each tag is encoded as an integer. We’ll use a more sophisticated technique to represent the input words (X) using what’s known as word embeddings.

However, to represent each tag in Y, we’ll simply use one-hot encoding scheme since there are only 12 tags in the dataset and the LSTM will have no problems in learning its own representation of these tags.

To use word embeddings, you can go for either of the following models:

  1. word2vec model
  2. GloVe model

We’re using the word2vec model for no particular reason. Both of these are very efficient in representing words. You can try both and see which one works better.

The dimension of a word embedding is: (VOCABULARY_SIZE, EMBEDDING_DIMENSION)

Use word embeddings for input sequences (X)

# word2vec
path = ‘../input/wordembeddings/GoogleNews-vectors-negative300.bin’# load word2vec using the following function present in the gensim library
word2vec = KeyedVectors.load_word2vec_format(path, binary=True)# assign word vectors from word2vec model
# each word in word2vec model is represented using a 300 dimensional vectorEMBEDDING_SIZE  = 300  
VOCABULARY_SIZE = len(word_tokenizer.word_index) + 1# create an empty embedding matix
embedding_weights = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))# create a word to index dictionary mapping
word2id = word_tokenizer.word_index# copy vectors from word2vec model to the words present in corpus
for word, index in word2id.items():
try:
embedding_weights[index, :] = word2vec[word]
except KeyError:
pass

Use one-hot encoding for output sequences (Y)

# use Keras’ to_categorical function to one-hot encode Y
Y = to_categorical(Y)

All the data preprocessing is now complete. Let’s now jump to the modeling part by splitting the data to train, validation, and test sets.

Before using RNN, we must make sure the dimensions of the data are what an RNN expects. In general, an RNN expects the following shape

Shape of X: (#samples, #timesteps, #features)

Shape of Y: (#samples, #timesteps, #features)

Now, there can be various variations in the shape that you use to feed an RNN depending on the type of architecture. Since the problem we’re working on has a many-to-many architecture, the input and the output both include number of timesteps which is nothing but the sequence length. But notice that the tensor X doesn’t have the third dimension, that is, number of features. That’s because we’re going to use word embeddings before feeding in the data to an RNN, and hence there is no need to explicitly mention the third dimension. That’s because when you use the Embedding() layer in Keras, the training data will automatically be converted to (#samples, #timesteps, #features) where #features will be the embedding dimension (and note that the Embedding layer is always the very first layer of an RNN). While using the embedding layer we only need to reshape the data to (#samples, #timesteps) which is what we have done. However, note that you’ll need to shape it to (#samples, #timesteps, #features) in case you don’t use the Embedding() layer in Keras.

Next, let’s build the RNN model. We’re going to use word embeddings to represent the words. Now, while training the model, you can also train the word embeddings along with the network weights. These are often called the embedding weights. While training, the embedding weights will be treated as normal weights of the network which are updated in each iteration.

In the next few sections, we will try the following three RNN models:

  • RNN with arbitrarily initialized, untrainable embeddings: In this model, we will initialize the embedding weights arbitrarily. Further, we’ll freeze the embeddings, that is, we won’t allow the network to train them.
  • RNN with arbitrarily initialized, trainable embeddings: In this model, we’ll allow the network to train the embeddings.
  • RNN with trainable word2vec embeddings: In this experiment, we’ll use word2vec word embeddings and also allow the network to train them further.

Uninitialized fixed embeddings

Let’s start with the first experiment: a vanilla RNN with arbitrarily initialized, untrainable embedding. For this RNN we won’t use the pre-trained word embeddings. We’ll use randomly initialized embeddings. Moreover, we won’t update the embeddings weights.

# create architecture
rnn_model = Sequential()# create embedding layer — usually the first layer in text problems
# vocabulary size — number of unique words in data
rnn_model.add(Embedding(input_dim = VOCABULARY_SIZE, 
# length of vector with which each word is represented
output_dim = EMBEDDING_SIZE, 
# length of input sequence
input_length = MAX_SEQ_LENGTH, 
# False — don’t update the embeddings
trainable = False 
))# add an RNN layer which contains 64 RNN cells
# True — return whole sequence; False — return single output of the end of the sequence
rnn_model.add(SimpleRNN(64, 
return_sequences=True
))# add time distributed (output at each sequence) layer
rnn_model.add(TimeDistributed(Dense(NUM_CLASSES, activation=’softmax’)))#compile model
rnn_model.compile(loss      =  'categorical_crossentropy',
optimizer =  'adam',
metrics   =  ['acc'])# check summary of the model
rnn_model.summary()
#fit model
rnn_training = rnn_model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_validation, Y_validation))

We can see here, after ten epoch, it is giving fairly decent accuracy of approx 95%. Also, we are getting a healthy growth curve below.

Uninitialized trainable embeddings

Next, try the second model — RNN with arbitrarily initialized, trainable embeddings. Here, we’ll allow the embeddings to get trained with the network. All I am doing is changing the parameter trainable to true i.e trainable = True. Rest all remains the same as above. On checking the model summary we can see that all the parameters have become trainable. i.e trainable params are equal to total params.

# check summary of the model
rnn_model.summary()

On fitting the model the accuracy has grown significantly. It has gone up to approx 98.95% by allowing the embedding weights to train. Therefore embedding has a significant effect on how the network is going to perform.

we’ll now try the word2vec embeddings and see if that improves our model or not.

Using pre-trained embedding weights

Let’s now try the third experiment — RNN with trainable word2vec embeddings. Recall that we had loaded the word2vec embeddings in a matrix called ‘embedding_weights’. Using word2vec embeddings is just as easy as including this matrix in the model architecture.

The network architecture is the same as above but instead of starting with an arbitrary embedding matrix, we’ll use pre-trained embedding weights (weights = [embedding_weights]) coming from word2vec. The accuracy, in this case, has gone even further to approx 99.04%.

The results improved marginally in this case. That’s because the model was already performing very well. You’ll see much more improvements by using pre-trained embeddings in cases where you don’t have such a good model performance. Pre-trained embeddings provide a real boost in many applications.

To solve the vanishing gradients problem, many attempts have been made to tweak the vanilla RNNs such that the gradients don’t die when sequences get long. The most popular and successful of these attempts has been the long, short-term memory network, or the LSTM. LSTMs have proven to be so effective that they have almost replaced vanilla RNNs.

Thus, one of the fundamental differences between an RNN and an LSTM is that an LSTM has an explicit memory unit which stores information relevant for learning some task. In the standard RNN, the only way the network remembers past information is through updating the hidden states over time, but it does not have an explicit memory to store information.

On the other hand, in LSTMs, the memory units retain pieces of information even when the sequences get really long.

Next, we’ll build an LSTM model instead of an RNN. We just need to replace the RNN layer with LSTM layer.

# create architecture
lstm_model = Sequential()
# vocabulary size — number of unique words in data
# length of vector with which each word is represented
lstm_model.add(Embedding(input_dim = VOCABULARY_SIZE, 
output_dim = EMBEDDING_SIZE, 
# length of input sequence
input_length = MAX_SEQ_LENGTH, 
# word embedding matrix
weights = [embedding_weights],
# True — update embeddings_weight matrix
trainable = True 
))# add an LSTM layer which contains 64 LSTM cells
# True — return whole sequence; False — return single output of the end of the sequence
lstm_model.add(LSTM(64, return_sequences=True))
lstm_model.add(TimeDistributed(Dense(NUM_CLASSES, activation=’softmax’)))#compile model
rnn_model.compile(loss      =  'categorical_crossentropy',
optimizer =  'adam',
metrics   =  ['acc'])# check summary of the model
rnn_model.summary()
lstm_training = lstm_model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_validation, Y_validation))

The LSTM model also provided some marginal improvement. However, if we use an LSTM model in other tasks such as language translation, image captioning, time series forecasting, etc. then you may see a significant boost in the performance.

Keeping in mind the computational expenses and the problem of overfitting, researchers have tried to come up with alternate structures of the LSTM cell. The most popular one of these alternatives is the gated recurrent unit (GRU). GRU being a simpler model than LSTM, it’s always easier to train. LSTMs and GRUs have almost completely replaced the standard RNNs in practice because they’re more effective and faster to train than vanilla RNNs (despite the larger number of parameters).

Let’s now build a GRU model. We’ll then also compare the performance of the RNN, LSTM, and the GRU model.

# create architecture
lstm_model = Sequential()
# vocabulary size — number of unique words in data
# length of vector with which each word is represented
lstm_model.add(Embedding(input_dim = VOCABULARY_SIZE, 
output_dim = EMBEDDING_SIZE, 
# length of input sequence
input_length = MAX_SEQ_LENGTH, 
# word embedding matrix
weights = [embedding_weights],
# True — update embeddings_weight matrix
trainable = True 
))# add an LSTM layer which contains 64 LSTM cells
# True — return whole sequence; False — return single output of the end of the sequence
lstm_model.add(GRU(64, return_sequences=True))
lstm_model.add(TimeDistributed(Dense(NUM_CLASSES, activation=’softmax’)))#compile model
rnn_model.compile(loss      =  'categorical_crossentropy',
optimizer =  'adam',
metrics   =  ['acc'])# check summary of the model
rnn_model.summary()

There is a reduction in params in GRU as compared to LSTM. Therefore we do get a significant boost in terms of computational efficiency with hardly any decremental effect in the performance of the model.

gru_training = gru_model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_validation, Y_validation))

The accuracy of the model remains the same as the LSTM. But we saw that the time taken by an LSTM is greater than a GRU and an RNN. This was expected since the parameters in an LSTM and GRU are 4x and 3x of a normal RNN, respectively.

For example, when you want to assign a sentiment score to a piece of text (say a customer review), the network can see the entire review text before assigning them a score. On the other hand, in a task such as predicting the next word given previous few typed words, the network does not have access to the words in the future time steps while predicting the next word.

These two types of tasks are called offline and online sequence processing respectively.

Now, there is a neat trick you can use with offline tasks — since the network has access to the entire sequence before making predictions, why not use this task to make the network ‘look at the future elements in the sequence’ while training, hoping that this will make the network learn better?

This is the idea exploited by what is called bidirectional RNNs.

By using bidirectional RNNs, it is almost certain that you’ll get better results. However, bidirectional RNNs take almost double the time to train since the number of parameters of the network increase. Therefore, you have a tradeoff between training time and performance. The decision to use a bidirectional RNN depends on the computing resources that you have and the performance you are aiming for.

Finally, let’s build one more model — a bidirectional LSTM and compare its performance in terms of accuracy and training time as compared to the previous models.

# create architecturebidirect_model = Sequential()
bidirect_model.add(Embedding(input_dim = VOCABULARY_SIZE,
output_dim = EMBEDDING_SIZE,
input_length = MAX_SEQ_LENGTH,
weights = [embedding_weights],
trainable = True
))
bidirect_model.add(Bidirectional(LSTM(64, return_sequences=True)))
bidirect_model.add(TimeDistributed(Dense(NUM_CLASSES, activation=’softmax’)))#compile model
bidirect_model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['acc'])# check summary of model
bidirect_model.summary()

You can see the no of parameters has gone up. It does significantly shoot up the no of params.

bidirect_training = bidirect_model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_validation, Y_validation))

The bidirectional LSTM did increase the accuracy substantially (considering that the accuracy was already hitting the roof). This shows the power of bidirectional LSTMs. However, this increased accuracy comes at a cost. The time taken was almost double than of a normal LSTM network.

Below is the quick summary of each of the four models we tried. We can see a trend here as we move from one model to another.

loss, accuracy = rnn_model.evaluate(X_test, Y_test, verbose = 1)
print(“Loss: {0},nAccuracy: {1}”.format(loss, accuracy))
loss, accuracy = lstm_model.evaluate(X_test, Y_test, verbose = 1)
print(“Loss: {0},nAccuracy: {1}”.format(loss, accuracy))
loss, accuracy = gru_model.evaluate(X_test, Y_test, verbose = 1)
print(“Loss: {0},nAccuracy: {1}”.format(loss, accuracy))
loss, accuracy = bidirect_model.evaluate(X_test, Y_test, verbose = 1)
print("Loss: {0},nAccuracy: {1}".format(loss, accuracy))

Spread the word

This post was originally published by Tanya Dayanand at Medium [AI]

Related posts