Introduction to Stemming
This article is revolving around stemming which is a text processing technique.It includes different Stemming Algorithms/ Approaches.
Keyword stemming is search terms that are derived from a root word or stem. A stemmer is a program that performs this derivation. Search engines often use stemming in their algorithms to identify synonyms, and have been since 1997 (Google) and 1999 (Yahoo!).
What is Stemming?
Stemming is a text-preprocessing technique. It is used for removing the affixes from the words to convert them into their root/base form. The word’s root is formed by removing affixes such as (-ed,-ize, -s,-de,mis). The root word formed after stemming may or may not be a word with some meaning. The root word is not always a word by itself. It can be just a part of the word.
For example:
The root word of Amusing, Amusement, or Amused will be Amus
- Amusing – Amus
- Amusement – Amus
- Amused – Amus
Best-suited IT & Software courses for you
Learn IT & Software with these high-rated online courses
Application of Stemming
Primarily used in tagging systems, indexing, tag mapping, SEO, Web search results, and information retrieval. For example, searching for play on Google will also provide player playing, as the root of both terms is play.
Typical errors of stemming
There are majorly three types of errors you would find in stemming:
- Overstemming
- Understemming
Over-stemming: Occurs when too much is removed.
For example:
‘wander’ → ‘wand’
‘news’ → ‘new’;
‘universal’, ‘universe’, ‘universities’, and ‘university’ →’univers’.
Under-stemming: Occurs when two words are stemmed from the same root.
For example,
‘knavish’ → ‘knavish’
‘data’ → ‘dat’
‘datum’ → ‘datu’
NOTE: Both data and datum have the same root yet they form two separate words.
Stemming Algorithms/ Approaches
NLTK is a python library for Natural Language Processing. There are different types of stemming algorithms available for use:
- Porter Stemmer
- Snowball Stemmer
- Lancaster Stemmer
Porter’s Stemmer algorithm
Porter stemmer is a suffix stripping algorithm. It uses predefined rules to convert words into their root forms.
- The algorithm removes and replaces well-known suffixes of English words
- Porter Stemmer is known for its speed and simplicity
- Majorly used in data mining and information retrieval
Porter Stemming Rule:
SSES → SS
IES → I
SS → SS
S →
For example:
word: stem
program: program
programs: program
programming: program
Code:
# import the porter stemmer from NLTK
from nltk.stem import PorterStemmer
ps = PorterStemmer()
# Add some words in the list to be stemmed
words = [“program”, “programs”, “programmer”, “programmers”]
for w in words:
print(w, ” : “, ps.stem(w))
Output:
program: program
programs: program
programmer : programm
programmers : programm
We are importing the PorterStemmer() from the NLTK library in python in the above code. This module will help in removing the suffixes of known English words.
Code: Stemming from a sentence
To stem from a sentence as input, one needs to tokenize (divide the sentence into words) before applying the stemming algorithm to it.
from nltk.stem import PorterStemmer
#importing the tokenizer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = “programmer programs in a programming languages”
words = word_tokenize(sentence)
for w in words:
print(w, ” : “, ps.stem(w))
Output:
programmer : programm
programs : program
in : in
a : a
programming : program
languages : languag
Snowball Stemmer Algorithm
- NLTK has SnowballStemmer class to implement Snowball Stemmer algorithms
- Supports additional 15 non-English languages
- A.k.a Porter2 stemming algorithm as it is a better version of the Porter
Common Rules of Snowball Stemmer
Snowball Stemming Rules:
ILY → ILI
LY → Nill
SS → SS
S → Nill
ED → E,Nill
For example:
Word Stem
hated hate
university univers
easily easili
singing sing
Code:
import nltk
from nltk.stem import SnowballStemmer
French_stemmer = SnowballStemmer(‘french’)
French_stemmer.stem (‘Bonjoura’)
Output:
bonjour
Lancaster Stemmer Algorithm
A word stemmer based on the Lancaster (Paice/Husk) stemming algorithm.
- Lancaster Stemmer is the fastest & aggressive stemming algorithm
- Reduce your corpus to a great extent
- Not suitable if looking for more distinction
Code:
# import the porter stemmer from NLTK
from nltk.stem import LancasterStemmer
ps = LancasterStemmer()
# Add some words in the list to be stemmed
words = [“program”, “programs”, “programmer”, “programmers”]
for w in words:
print(w, ” : “, ps.stem(w))
Output:
program: program
programs: program
programmer: program
programmers: program
Conclusion
This was all about the stemming concept in text preprocessing. In our next blog, we will learn about the lemmatization technique in text preprocessing.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio