Data augmentation in NLP

Published by FirstAlign

Close up of roses on white background

Goals of the article

  1. Understanding what is data augmentation in NLP
  2. Discuss various data augmentation techniques in NLP
  3. Summary 

In this article we will review the techniques for Data Augmentation in NLP.

What is Data Augmentation in NLP?

While performing any supervised machine learning task we need a labeled dataset, but in reality the majority of data is raw. To get labeled data either we either have to collect the data and annotate it manually, or get data from a source who has it already annotated. There is a problem with both of these techniques, manually annotating thousands of records is very time consuming, and getting data from an external third party is not always easy as it can be proprietary and expensive. Freely available data available doesn’t suit the purpose. There should be some other technique to solve these problems. A similar difficulty is solved in computer vision with the help of Data Augmentation.

An Example

So if you want to train a model to classify images of roses and lilies, but have very limited data, you can modify the image slightly so that it can both preserve the label and be able to work as a new sample.

For example, the picture of a rose is slightly shifted to left, or it is zoomed in on, or it flips left or right. All these things don’t affect the label, but by this technique, they generate a lot of samples from the single existing one.

This is how data augmentation works in computer vision, but there are problems with this technique if we flip an image to invert it. It will show an inverted rose that doesn’t exist, so care needs to be taken with the augmentation process.

Similarly, in-text augmentation we will have a dataset with a limited amount of data. We augment it keeping in mind the context so that label is preserved.

For example, we can take a Resume usually they have multiple sections each independent of one another. So, we can create multiple resumes for the same class just by rearranging the section.

Modification made in a single image of a rose for data augmentation

Example for Data augmentation in text via NLP

Sentence 1: The Rose beautiful flower.

The modified version of Sentence 1: The Rose is pretty flower.

The above example shows how we can take a single sentence to make many variants without changing its label.

Various Data Augmentation Techniques in NLP

There are many ways data can be augmented, here are some of the techniques which we will discuss one by one.

  1. Sentence Shuffling
  2. QWERTY Keyboard Error Injection
  3. Thesaurus-based substitution
  4. Word embedding substitution 
  5. Masked Language Model

Sentence Shuffling

Sentence Shuffling is one of the simplest ways to carry out data augmentation. In this technique sentences in the text are shuffled to create multiple versions of the same sentence. The data augmentation image below shows an example of sentence shuffling.

Sentence shuffling Example

QWERTY Keyboard Error Injection

This method uses the distance between keys on a qwerty keyboard layout to simulate the mistakes made during typing. For example, I have to type “very” but this method can produce “bery” as “v” and “b” are next to each other on the keyboard so there is a chance of a mistake, The picture below shows an example of a QWERTY keyboard error injection.

QWERTY keyboard error injection Example

Thesaurus-based substitution

Thesaurus based substitution selects a word from a sentence then executes a look-up for synonyms for that word from a language database. The selection of the words is done randomly. An example of this type of data augmentation is shown below;

Thesaurus-based substitution Example

Word embedding substitution

In this technique, some words from sentences are replaced from word embeddings such as Word2Vec. The selection of which words is based on the nearest neighbor. For example, the word “beautiful” there could be various words neighboring it such as “pretty”, “lovely”, “attractive” and “alluring”. We can choose any number of close neighbors, and use them to form a sentence. 

Word Embedding substitution Example

Masked Model Language

Masking selects a word in a sentence to be masked uses transformer models such as BERT to predict it. Model predict the masked word based on the context of the sentence. This can result in multiple predictions and can create several sentences. An example if given below;

Example for Masked Language Model

In Summing Up

In this article, we have discussed Data Augmentation, it’s process in NLP, and overviewed the various performance techniques.

Each technique discussed has a particular use case, with no single technique able to be applied as a general-purpose solution.

Sentence shuffling or thesaurus-based substitution is simple and can be used for a dataset containing short sentences, while qwerty based error injection is present in a more real-time scenario where typographical errors are more common. Word embedding and the Masked language model are more advanced and are used where the context of the sentence should also be considered.

Hope you enjoyed the article stay tuned till next one happy coding ❤

Published by FirstAlign

Click here to connect with us

Spread the word

Related posts