Basics in NLP data annotation

Published by FirstAlign

This is the era of artificial intelligence and machine learning, the applications we use in our daily life have gone from purely mobile to highly intelligence, but to gain this intelligence we are require a lot of data. This data is used to train machine learning models.

However, to create a dataset, data annotation is used. In this article, we will review data annotation, what it is? why it is used? and experience some hands on uses cases of the data annotation tools in NLP.

What is Data Annotation?

Data Annotation is the procedure of labeling data, which is available in various formats such as voice, images, videos, and text for example. The labeling for each format occurs in its unique way. For our example we will look at annotation for textual data for various NLP operations such as Text Classification, Named Entity recognition, or POS tagging.

For each operation the data is labeled differently. For example, three different way of tagging;

If we are tagging data for classification, we will take a complete text block and create a label for that entire block.

If we are labeling a dataset for POS [1] tagging we will instead tag individual words, creating a label for its respective parts.

For named entity recognition we will tag words according to their named entity.

Why Data Annotation is used?

Labeled data is a requirement of the supervised Machine Learning process, and to simplify classification tasks. The annotation and labeling creates a reference library that will improve the accuracy of the processes outcome.

An Example of Data Annotation

So let us perform a text annotation exercise. For performing the annotation, we are going to use an open-source project doccano.

Doccano provides annotation features for text classification, sequence labeling, and sequence to sequence tasks. With this labeled data can be created for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotating. You can build a dataset in hours.

A live demo example can be seen here (https://doccano.herokuapp.com/demo/text-classification/).

We will go through this step by step below; this will demonstrate how we can annotate text with a ‘positive’ or ‘negative’ label to allow the performance of sentiment analysis.

Doccano interface for sentiment annotation

In the picture above we can see the available text, there are two buttons to label this tag positive or negative. The screenshot shows the sample data provided by Doccano. 

Interface for Labeling Named Entity Recognition

In this interface, we can select the text and click on entities such as a person, location, organization, event, date, and others to assign or label it by that entity.

Doccano uses a very simple interface for making the labeling easy. You can use Doccano locally with three dependencies;

  1. GitHub
  2. Docker
  3. Docker Compose

After this installing, these dependencies simply follow the installation guidelines given in its GitHub repository. 

You can run Doccano in one of two modes development mode and production.

Why use the Doccano?

The reason for using Doccano for the annotation of textual data is:

  1. It is easy to use
  2. It is open source
  3. It provides collaborative annotation
  4. It has multi-language support
  5. It has mobile support

Conclusion

In this article we have discussed data annotation, the different type and why there is need. We used an opensource tool (Doccno) to show data annotation example in real-time, including its features and reasons for use.

I hope you have enjoyed the article, until next one stay tuned and happy coding ❤

Published by FirstAlign

Click here to connect with us

Spread the word

Related posts