Understanding Language Models in NLP

Natural language applications such as a chatbot or machine translation wouldn’t have been possible without language models.

According to Page 105, Neural Network Methods in Natural Language Processing, “Language modelling is the task of assigning a probability to sentences in a language. Besides assigning a probability to each sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words”.

For example, in a chatbot, if someone types “Hi, how are you?” there are various possible answers to this, so with the language model’s help, the reply is decided.

Let’s take another example, if we are performing machine translation there can be numerous translations. Using the likelihood or probability it can be agreed which translation with the help of language model.

Let’s see how the language model works.

How language model works

A language model predicts the next word in a sequence by calculating the probability of occurrence of that word. It does this with the help of an algorithm that creates the rules for the context using Natural language.

Language model predicting the next word

In the above example, we can see there are a number of possibilities for the next word, the language model will select one with the highest probability.

Of course algorithms vary from model to model, so let’s review them.

Types of Language Model

There are basically two types of language model

  1. Statistical Language model
  2. Neural Language Models

Statistical Language model (SLM)

Statistical language models are the probability distribution of P(s) for all the possible sentences. Assuming S to be a sequence of text then statistical modelling is all about P(S). Here P(S) is the probability of a sentence.

As the statistical language model is about the ‘collective word probability’ it will collate P(W1, W2, W3, W4) in a sentence. This is done with the help of chain rule P(Next word| Previous words history). An example of chain rules is given below.

Chain rule example from Introduction to statistical language models and their applications.

While there are different statistical language models, N-Grams is one of the most common.

Using N-grams, we can calculate the probability of the whole sentence sequence by breaking it into the words and calculating the conditional probability of each term based on the history of the previous unit.

We apply the chain rule of probability on the entire sequence as shown below;

Chain rule of probability for N-grams

Using a Markov model to predict a future unit we don’t have to look too far into the past so to be able to predict the next word, we can take one previous word, the last two or three words. These are compiled in a unigram, bigram and trigram model respectively, the equation for trigram model is as shown below.

Equation of Trigram (N-grams where n=3)

Neural Language Model (NLM)

With the boost of a Neural Network expansion in a multitude of the directions is possible for language modelling. With the use of the neural network for language models we can achieve previously impossible results when applying classical techniques.

Nonlinear neural network models solve some of the shortcomings of traditional language models: they allow conditioning on increasingly large context sizes with only a linear increase in the number of parameters. They can alleviate the need for manually designing back-off orders, and they support generalization across different contexts”.

In Recurrent neural network-based language model it proved with empirical evidence that NLM performed way better than the N-grams model, however the high computational complexity of NLM is considered a drawback.

Limitations of Language models

Language model do have their limitations. In Exploring the Limits of Language Modeling researchers have discussed some of the constraints, and proved them with the help of experiments and observations.

  1. Size Matters: The major limitation around the language model is the size of data on which these models are trained. To create a good enough model, we need a massive dataset. This is a problem from the perspective of collecting as well as processing because the more extensive the data, the more computation required.
  2. Regularization Importance: These models show a tendency to overfit on even a small dataset. Regularization is needed in the form of dropouts or non-recurrent connections.
  3. Importance of Sampling: These models show different results with different sampling. Researchers have used Noise Contrastive Estimation NCE and importance sampling (IS). While and IS was proved better than NCE it demonstrated that we should always consider sampling during language modelling.
  4. Ensembles are essential: Sometimes models can be weak on their own. Creating an ensemble or collection of models can be used in unison to improve accuracy and performance.

Hope you enjoyed this article, stay tuned until next happy coding ❤


Spread the word

Related posts