Published by FirstAlign
Goals of the article
- Understanding what is data augmentation in NLP
- Discuss various data augmentation techniques in NLP
In this article we will review the techniques for Data Augmentation in NLP.
What is Data Augmentation in NLP?
While performing any supervised machine learning task we need a labeled dataset, but in reality the majority of data is raw. To get labeled data either we either have to collect the data and annotate it manually, or get data from a source who has it already annotated. There is a problem with both of these techniques, manually annotating thousands of records is very time consuming, and getting data from an external third party is not always easy as it can be proprietary and expensive. Freely available data available doesn’t suit the purpose. There should be some other technique to solve these problems. A similar difficulty is solved in computer vision with the help of Data Augmentation.
So if you want to train a model to classify images of roses and lilies, but have very limited data, you can modify the image slightly so that it can both preserve the label and be able to work as a new sample.
For example, the picture of a rose is slightly shifted to left, or it is zoomed in on, or it flips left or right. All these things don’t affect the label, but by this technique, they generate a lot of samples from the single existing one.
This is how data augmentation works in computer vision, but there are problems with this technique if we flip an image to invert it. It will show an inverted rose that doesn’t exist, so care needs to be taken with the augmentation process.
Similarly, in-text augmentation we will have a dataset with a limited amount of data. We augment it keeping in mind the context so that label is preserved.
For example, we can take a Resume usually they have multiple sections each independent of one another. So, we can create multiple resumes for the same class just by rearranging the section.
Example for Data augmentation in text via NLP
Sentence 1: The Rose beautiful flower.
The modified version of Sentence 1: The Rose is pretty flower.
The above example shows how we can take a single sentence to make many variants without changing its label.
Various Data Augmentation Techniques in NLP
There are many ways data can be augmented, here are some of the techniques which we will discuss one by one.
- Sentence Shuffling
- QWERTY Keyboard Error Injection
- Thesaurus-based substitution
- Word embedding substitution
- Masked Language Model
Sentence Shuffling is one of the simplest ways to carry out data augmentation. In this technique sentences in the text are shuffled to create multiple versions of the same sentence. The data augmentation image below shows an example of sentence shuffling.
QWERTY Keyboard Error Injection
This method uses the distance between keys on a qwerty keyboard layout to simulate the mistakes made during typing. For example, I have to type “very” but this method can produce “bery” as “v” and “b” are next to each other on the keyboard so there is a chance of a mistake, The picture below shows an example of a QWERTY keyboard error injection.
Thesaurus based substitution selects a word from a sentence then executes a look-up for synonyms for that word from a language database. The selection of the words is done randomly. An example of this type of data augmentation is shown below;
Word embedding substitution
In this technique, some words from sentences are replaced from word embeddings such as Word2Vec. The selection of which words is based on the nearest neighbor. For example, the word “beautiful” there could be various words neighboring it such as “pretty”, “lovely”, “attractive” and “alluring”. We can choose any number of close neighbors, and use them to form a sentence.
Masked Model Language
Masking selects a word in a sentence to be masked uses transformer models such as BERT to predict it. Model predict the masked word based on the context of the sentence. This can result in multiple predictions and can create several sentences. An example if given below;
In Summing Up
In this article, we have discussed Data Augmentation, it’s process in NLP, and overviewed the various performance techniques.
Each technique discussed has a particular use case, with no single technique able to be applied as a general-purpose solution.
Sentence shuffling or thesaurus-based substitution is simple and can be used for a dataset containing short sentences, while qwerty based error injection is present in a more real-time scenario where typographical errors are more common. Word embedding and the Masked language model are more advanced and are used where the context of the sentence should also be considered.
Hope you enjoyed the article stay tuned till next one happy coding ❤
Published by FirstAlign