Stemming in NLP: A quick guide to Stemming and its types

Published by FirstAlign

For natural language processing and natural language understanding, there are a lot of steps that going on in the background.

One of those steps is Stemming. In this blog we will look to understand Stemming, review various examples and we will see how stemming works by, looking at various types of ‘stemmers’.

What is Stemming?

All the words have their dictionary version, that is changed sometimes for grammatical reasons, for example, a “Question” is a base word for a sentence in which we are inquiring something but we can add pluralization or an “s” at the end to show more than one question. By adding a suffix we can add more context to the word, however in the use of Stemming we remove that suffix from a to reduce the word to its root word, i.e. converting word into its dictionary version. The picture below is an example of stemming;

Stemming example from NLP Standford

Why stemming is performed?

Inflection is a process of word formation in which we change the word to add grammatical meaning such as tenses, gender, case, form, etc. The main aim for performing stemming is to reduce these inflectional forms, so whenever we are searching or performing information retrieval, we type a certain keyword, stem it to it’s root, and find all the words which have the same root word. This way we can perform searching and information retrieval in a more efficient way.

There are different Stemmers through which we can perform stemming. These are;

  1. The Porter Stemmer algorithm
  2. The Lovins Stemmer algorithm
  3. The Dawson Stemmer algorithm
  4. The Krovetz Stemmer algorithm

Let’s discuss them one by one.

The Porter Stemmer algorithm

Porter’s Stemmer removes the inflectional and morphological ending from the words in English. The most common use of this type of stemmer is in Information Retrieval System for normalization. This stemmer was first proposed in the stemming algorithm then different versions of this stemmer are being created and used till today. This stemmer was originally written in BCPL. This stemmer is implemented using various libraries in different programming languages below given is an example of the use of porter stemmer in Python using the NLTK library.

Porter Stemmer Example using NLTK

The above example shows the inflectional endings in the words “loved” and “Pythons” were removed as expected.

The Lovins Stemmer algorithm

The Lovins Stemmer removes the longest suffix in the word, then uses this stemmed word, re-codes it to make it a valid word. It is one of the oldest stemmers. Lovins stemmer is bigger than Porter because it has a huge list of endings. Because of this builtin list it is faster. In this stemmer there is a trade-off between size and speed. There are 294 endings, 29 conditions, and 35 transformation rules utilized in Lovins.

Below is the example for the Lovins stemmer. It is been implemented in various programming languages. Below is the python code implementation of the Lovins Stemmer.

The Dawson Stemmer algorithm

The Dawson Stemmer works much the same as Lovins, it just extends the approach by making the list of suffix’s large. These suffixes are stored and organized by last letters, and their length making retrieval easier and therefore more optimized. They are structured as a set of separated character trees for quick access. The Dawson Stemmer has more than 1000 suffix’s in the list.

Below is a generic algorithm for the Dawson Stemmer.

  1. Get the input word.
  2. Get the matching suffix;
    • the suffic pool is reverse indexed by the length;
    • the suffix pool is reverse indexed by the last character.
  3. Remove the longest suffix from the word with an exact match.
  4. Recode the word using a mapping table.
  5. Convert stem into a valid word.

When working with the Dawson Stemmer problem can occur as is it very complicated to implement. Due to the bigger size of the suffix list, it is however faster and provides better results.

The Krovetz Stemmer algorithm

The Krovetz Stemmer is very much simpler than the others. It stems three very common inflectional derivations, firstly it converts the plurals in the singular, secondly it converts past tense into the present tense, and thirdly it deletes “ing” in the ending of the words. It was proposed by Robert Krovetz in 1993. It is very lightweight and simple to implement, but this simplicity makes it limited and not very efficient for a large number of documents. Py Krovetz is a python implementation of this stemmer, and an example of the results can be seen below;

Example for Krovetz Stemmer from NEU Lecture on Stemming

The Lancaster Stemmer algorithm

Lancaster stemmer is an iterative stemmer in which rules are saved externally. It is a very simple, but heavy duty stemmer due to iterative nature. In the Lancaster stemmer there is a table of 120 defined rules. These apply to the last letter of the word that either we need to remove, or replace. If there is no rule it will terminate. There are two additional conditions for termination, 1. where the word starts with vowel and two characters are left, and 2. if the word starts with consonant and has three letters.

Lancaster Example from NLTK

In summing up, several observations

In this post, we have discussed ‘what is stemming’ in NLP and how stemmers work. What are the different types including their relative pros and cons. None of them can be concluded as a’one size fits all’ solution, all them were well suited for different situations.

  • The Porter’s Stemmer is light and produced the best results, but the stem it produced do not always contain valid words.
  • The Lovin Stemmer was fast, but it took a lot of space, didn’t have all the suffixes in the list, and so it was also prone to error due to this limitation.
  • The Dawson Stemmer is also fast, has bigger suffix list than Lovins, but it is very complex and takes up a lot of space.
  • The Krovetz Stemmer was the simplest among them all. It is easy to implement, fast and doesn’t take much of space, but due to the simplicity it misses a lot of words. Words where stemming is required but it doesn’t perform.
  • The Lancaster Stemmer is small and simple, but heavy duty due to iterative nature.

From all this, we can conclude that choosing a stemmer to use depends on a series of criteria that may change over time. A single stemmer cannot be utilized for all purposes, thus deriving a combination strategy would be the optimal solution depending upon the business case of the moment.

Hope you enjoyed the article until next article stay tuned and Happy Coding ❤

References

Published by FirstAlign

Click here to connect with us

Spread the word

Related posts