Introduction to Stemming

3 mins read1.4K Views Comment

Updated on Mar 29, 2024 16:04 IST

This article is revolving around stemming which is a text processing technique.It includes different Stemming Algorithms/ Approaches.

Keyword stemming is search terms that are derived from a root word or stem. A stemmer is a program that performs this derivation. Search engines often use stemming in their algorithms to identify synonyms, and have been since 1997 (Google) and 1999 (Yahoo!).

What is Stemming?

Stemming is a text-preprocessing technique. It is used for removing the affixes from the words to convert them into their root/base form. The word’s root is formed by removing affixes such as (-ed,-ize, -s,-de,mis). The root word formed after stemming may or may not be a word with some meaning. The root word is not always a word by itself. It can be just a part of the word.

For example:

The root word of Amusing, Amusement, or Amused will be Amus

Amusing – Amus
Amusement – Amus
Amused – Amus

How to use Google Colab for Machine Learning Projects

In this article, we will briefly discuss how to use google colab for machine learning projects.In this article, we will briefly discuss how to use google colab for machine learning...read more

Read Later

Recommended online courses

Best-suited IT & Software courses for you

Learn IT & Software with these high-rated online courses

NIELIT O Level

UPTEC Computer Consultancy Limited, AllahabadCertificate

Total Fees

₹18 K

Duration

1 year

Master of Computer Applications (MCA)

IDOL Mumbai UniversityDegree

Total Fees

₹39.88 K

Duration

2 years

GNIIT Course

NIIT InstituteCertificate

3.6

Total Fees

– / –

Duration

2 years

Post Graduate Diploma in Computer Applications (PGDCA)

International Centre For Distance Education And Open LearningDiploma

Total Fees

₹14.78 K

Duration

1 year

NIELIT O Level

UPTEC Computer Consultancy Limited, KanpurCertificate

Total Fees

₹18 K

Duration

1 year

GNIIT course

National Institute of Information Technology PuneDiploma

Total Fees

– / –

Duration

2 years

Certified Business Accounting (CBA)

EduFirst Computer Institute, ThaneCertificate

Total Fees

₹10.8 K

Duration

6 months

Certificate in Computer Teacher Training Course (CTTC)

Shreyas Computer AcademyCertificate

Total Fees

₹19.5 K

Duration

12 months

Siemens Certified UG NX (Unigraphics) Training

Sapience TechSystemsCertificate

4.4

Total Fees

₹16.25 K

Duration

4 weeks

CIA with SAP

ICACertificate

Total Fees

– / –

Duration

80 hours

Application of Stemming

Primarily used in tagging systems, indexing, tag mapping, SEO, Web search results, and information retrieval. For example, searching for play on Google will also provide player playing, as the root of both terms is play.

Typical errors of stemming

There are majorly three types of errors you would find in stemming:

Overstemming
Understemming

Over-stemming: Occurs when too much is removed.

For example:

‘wander’ → ‘wand’

‘news’ → ‘new’;

‘universal’, ‘universe’, ‘universities’, and ‘university’ →’univers’.

Under-stemming: Occurs when two words are stemmed from the same root.

For example,

‘knavish’ → ‘knavish’

‘data’ → ‘dat’

‘datum’ → ‘datu’

NOTE: Both data and datum have the same root yet they form two separate words.

Stemming Algorithms/ Approaches

NLTK is a python library for Natural Language Processing. There are different types of stemming algorithms available for use:

Porter Stemmer
Snowball Stemmer
Lancaster Stemmer

Porter’s Stemmer algorithm

Porter stemmer is a suffix stripping algorithm. It uses predefined rules to convert words into their root forms.

The algorithm removes and replaces well-known suffixes of English words
Porter Stemmer is known for its speed and simplicity
Majorly used in data mining and information retrieval

Porter Stemming Rule:

SSES → SS

IES → I

SS → SS

S →

For example:

word: stem

program: program

programs: program

programming: program

Code:

# import the porter stemmer from NLTK

from nltk.stem import PorterStemmer

ps = PorterStemmer()

# Add some words in the list to be stemmed

words = [“program”, “programs”, “programmer”, “programmers”]

for w in words:

print(w, ” : “, ps.stem(w))

Output:

program: program

programs: program

programmer : programm

programmers : programm

We are importing the PorterStemmer() from the NLTK library in python in the above code. This module will help in removing the suffixes of known English words.

Code: Stemming from a sentence

To stem from a sentence as input, one needs to tokenize (divide the sentence into words) before applying the stemming algorithm to it.

from nltk.stem import PorterStemmer

#importing the tokenizer

from nltk.tokenize import word_tokenize

ps = PorterStemmer()

sentence = “programmer programs in a programming languages”

words = word_tokenize(sentence)

for w in words:

print(w, ” : “, ps.stem(w))

Output:

programmer : programm

programs : program

in : in

a : a

programming : program

languages : languag

Snowball Stemmer Algorithm

NLTK has SnowballStemmer class to implement Snowball Stemmer algorithms
Supports additional 15 non-English languages
A.k.a Porter2 stemming algorithm as it is a better version of the Porter

Common Rules of Snowball Stemmer

Snowball Stemming Rules:

ILY → ILI

LY → Nill

SS → SS

S → Nill

ED → E,Nill

For example:

Word Stem

hated hate

university univers

easily easili

singing sing

Code:

import nltk

from nltk.stem import SnowballStemmer

French_stemmer = SnowballStemmer(‘french’)

French_stemmer.stem (‘Bonjoura’)

Output:

bonjour

Lancaster Stemmer Algorithm

A word stemmer based on the Lancaster (Paice/Husk) stemming algorithm.

Lancaster Stemmer is the fastest & aggressive stemming algorithm
Reduce your corpus to a great extent
Not suitable if looking for more distinction

Code:

# import the porter stemmer from NLTK

from nltk.stem import LancasterStemmer

ps = LancasterStemmer()

# Add some words in the list to be stemmed

words = [“program”, “programs”, “programmer”, “programmers”]

for w in words:

print(w, ” : “, ps.stem(w))

Output:

program: program

programs: program

programmer: program

programmers: program

Conclusion

This was all about the stemming concept in text preprocessing. In our next blog, we will learn about the lemmatization technique in text preprocessing.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

Introduction to Stemming

What is Stemming?

Best-suited IT & Software courses for you

NIELIT O Level

Master of Computer Applications (MCA)

GNIIT Course

Post Graduate Diploma in Computer Applications (PGDCA)

NIELIT O Level

GNIIT course

Certified Business Accounting (CBA)

Certificate in Computer Teacher Training Course (CTTC)

Siemens Certified UG NX (Unigraphics) Training

CIA with SAP

Application of Stemming

Typical errors of stemming

Stemming Algorithms/ Approaches

Porter’s Stemmer algorithm

Snowball Stemmer Algorithm

Lancaster Stemmer Algorithm

Conclusion

Top Picks & New Arrivals