Introduction to Stemming

Introduction to Stemming

3 mins read1.4K Views Comment
Updated on Mar 29, 2024 16:04 IST

This article is revolving around stemming which is a text processing technique.It includes different Stemming Algorithms/ Approaches.

2022_09_MicrosoftTeams-image-12.jpg

Keyword stemming is search terms that are derived from a root word or stem. A stemmer is a program that performs this derivation. Search engines often use stemming in their algorithms to identify synonyms, and have been since 1997 (Google) and 1999 (Yahoo!).

What is Stemming?

Stemming is a text-preprocessing technique. It is used for removing the affixes from the words to convert them into their root/base form. The word’s root is formed by removing affixes such as (-ed,-ize, -s,-de,mis). The root word formed after stemming may or may not be a word with some meaning. The root word is not always a word by itself. It can be just a part of the word.

For example:

The root word of Amusing, Amusement, or Amused will be Amus

  • Amusing – Amus
  • Amusement – Amus
  • Amused – Amus
How to use Google Colab for Machine Learning Projects
Recommended online courses

Best-suited IT & Software courses for you

Learn IT & Software with these high-rated online courses

18 K
1 year
39.88 K
2 years
– / –
2 years
18 K
1 year
– / –
2 years
10.8 K
6 months
19.5 K
12 months
16.25 K
4 weeks
name
ICACertificate
– / –
80 hours

Application of Stemming

Primarily used in tagging systems, indexing, tag mapping, SEO, Web search results, and information retrieval. For example, searching for play on Google will also provide player playing, as the root of both terms is play. 

Typical errors of stemming 

There are majorly three types of errors you would find in stemming:

  • Overstemming
  • Understemming

Over-stemming: Occurs when too much is removed. 

For example: 

‘wander’ → ‘wand’

‘news’ → ‘new’; 

‘universal’, ‘universe’, ‘universities’, and ‘university’ →’univers’. 

Under-stemming: Occurs when two words are stemmed from the same root.

For example

‘knavish’ → ‘knavish’

‘data’ → ‘dat’

‘datum’ → ‘datu’ 

NOTE: Both data and datum have the same root yet they form two separate words.

Stemming Algorithms/ Approaches 

NLTK is a python library for Natural Language Processing. There are different types of stemming algorithms available for use:

  • Porter Stemmer
  • Snowball Stemmer
  • Lancaster Stemmer

Porter’s Stemmer algorithm 

Porter stemmer is a suffix stripping algorithm. It uses predefined rules to convert words into their root forms.

  • The algorithm removes and replaces well-known suffixes of English words
  • Porter Stemmer is known for its speed and simplicity
  • Majorly used in data mining and information retrieval

Porter Stemming Rule:

SSES → SS

IES → I

SS → SS

S →

For example:

word: stem 

program: program

programs: program

programming: program

Code:

# import the porter stemmer from NLTK

from nltk.stem import PorterStemmer

ps = PorterStemmer()

# Add some words in the list to be stemmed

words = [“program”, “programs”, “programmer”, “programmers”]

for w in words:

    print(w, ” : “, ps.stem(w))

Output:

program: program

programs: program

programmer : programm

programmers : programm

We are importing the PorterStemmer() from the NLTK library in python in the above code. This module will help in removing the suffixes of known English words.

Code: Stemming from a sentence

To stem from a sentence as input, one needs to tokenize (divide the sentence into words) before applying the stemming algorithm to it.

from nltk.stem import PorterStemmer

#importing the tokenizer

from nltk.tokenize import word_tokenize 

ps = PorterStemmer()

sentence = “programmer programs in a programming languages”

words = word_tokenize(sentence)

for w in words:

    print(w, ” : “, ps.stem(w))

Output:

programmer : programm

programs : program

in : in

a : a

programming : program

languages : languag

Snowball Stemmer Algorithm

  • NLTK has SnowballStemmer class to implement Snowball Stemmer algorithms
  • Supports additional 15 non-English languages
  • A.k.a Porter2 stemming algorithm as it is a better version of the Porter

Common Rules of Snowball Stemmer

Snowball Stemming Rules:

ILY → ILI

LY → Nill

SS → SS

S → Nill

ED → E,Nill

For example:

Word Stem

hated hate

university univers

easily easili

singing sing

Code:

import nltk

from nltk.stem import SnowballStemmer

French_stemmer = SnowballStemmer(‘french’)

French_stemmer.stem (‘Bonjoura’)

Output:

bonjour

Lancaster Stemmer Algorithm

A word stemmer based on the Lancaster (Paice/Husk) stemming algorithm.

  • Lancaster Stemmer is the fastest & aggressive stemming algorithm 
  • Reduce your corpus to a great extent
  • Not suitable if looking for more distinction

Code:

# import the porter stemmer from NLTK

from nltk.stem import LancasterStemmer

ps = LancasterStemmer()

# Add some words in the list to be stemmed

words = [“program”, “programs”, “programmer”, “programmers”]

for w in words:

    print(w, ” : “, ps.stem(w))

Output:

program: program

programs: program

programmer: program

programmers: program

Conclusion

This was all about the stemming concept in text preprocessing. In our next blog, we will learn about the lemmatization technique in text preprocessing. 

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio