Text Pre-processing For Spam Filtering (With Codes)

8 mins read3K Views Comment

Updated on Jan 18, 2022 10:52 IST

by Sahib Singh

“Garbage in Garbage Out” this quote is one of the key fundamentals when it comes to machine learning and Natural language processing. Natural language processing is no different, it is a stream of data science where natural language data is given to machine learning algorithms for various purposes like: –

Sentiment Analysis
Spam filtering
Named entity recognition
Part of speech tagging etc.

In this blog, I will be covering some very useful text pre-processing techniques and we will see how text pre-processing can make a difference in your final results.

Some pre-processing steps that we will cover in this blog will be:-

Lower casing
Removal of punctuations
Removal of stop words
Removal of frequent words
Stemming
Lemmatization
Conversion of Emoticons to words
Removal of URLs
Removal of HTML tags
Spelling Correction
Removal of rare words

Remember we do not have to use every text pre-processing technique for our data and for every different use case we have to carefully select which works best for us.

For example, In the case of sentiment analysis, it is better not to remove emoticons since emojis convey some very important information to our data but it is useful to convert the emoticon back to text.

For this blog, we will pick up an email spam filtering dataset where the goal is to classify whether an email is a spam or not. Let’s look at our data: –

import numpy as np
import pandas as pd
df = pd.read_csv('/content/train-dataset.csv')
 
# shuffling all our data
df = df.sample(frac=1)
 
# reading only Message_body and label
df = df[['Message_body','Label']]
df

1. Lower Casing

It is a text pre-processing technique where all words are lowercased so that words like ‘cat’ and ‘CAT’ are treated the same way. This technique comes in handy while we are using Bag of words or Tf-Idf for making features out of our natural language data.
This might not be helpful while doing Part of Speech Tagging ( where nouns can be differentiated based on the case of text ) or Sentiment Analysis ( where Capital words generally depict anger ) it is recommended not to use this technique for your text pre-processing.

df['clean_msg']= df['clean_msg'].apply(lambda x: x.lower())

2. Removal of Punctuations

It is also a technique of text pre-processing where we try to remove unnecessary punctuation symbols because their presence does not make any significance in our text data.
For Example, ‘yippee’ and ‘yippee!’ are conveying the feeling of happiness and excitement and this exclamation mark is of no use here.
We will remove punctuation marks from string.punctuation
But you always add more symbols based on your use case.

#library that contains punctuation
import string
# list of all punctuations we have
print(string.punctuation)
 
#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
 
#storing the punctuation free text for both training and testing data
df['clean_msg']= df['clean_msg'].apply(lambda x:remove_punctuation(x))

3. Removal of Stopwords in Text Pre-processing

Stopwords are a set of words that do not value a text example ‘a’,’an’,’the’ these are the words that occur very frequently in our text data, but they are of no use. Many libraries have compiled stop words for various languages and we can use them directly and for any specific use case if we feel we can also add a more specific set of stop words to the list.

from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

b. Code for removal of stop words

Before stop words removal we need tokenized text

#defining function for tokenization
import re
#whitespace tokenizer
from nltk.tokenize import WhitespaceTokenizer
def tokenization(text):
    tk = WhitespaceTokenizer()
    return tk.tokenize(text)
 
#applying function to the column for making tokens in both Training and Testing data
df['tokenised_clean_msg']= df['clean_msg'].apply(lambda x: tokenization(x))

Now stop words removal

#importing nlp library
import nltk
nltk.download('stopwords')
#Stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
 
#defining the function to remove stopwords from tokenized text
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output
 
#applying the function for removal of stopwords
df['cleaned_tokens']= df['tokenised_clean_msg'].apply(lambda x:remove_stopwords(x))

4. Removal of Frequent Words

As we have removed stop words but in some cases, it’s better to remove the most frequent words from the data itself that are useless. The most frequent words in our corpus are:-

from collections import Counter
cnt = Counter()
for text in df["cleaned_tokens"].values:
    for word in text:
        cnt[word] += 1
 
cnt.most_common(10)

from collections import Counter
cnt = Counter()
for text in df["cleaned_tokens"].values:
    for word in text:
        cnt[word] += 1
 
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
 
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in text if word not in FREQWORDS])

5. Stemming in Text Pre-processing

It is a text standardization technique where a word is reduced to its stem/base word. Example: “jabbing” → “jab” and “kicking” → “kick”. The main aim for stemming is that we can reduce the vocab size before inputting it into any machine learning model.

#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer
#defining the object for stemming
porter_stemmer = PorterStemmer()
 
#defining a function for stemming
def stemming(text):
  stem_text = [porter_stemmer.stem(word) for word in text]
  return stem_text
 
# applying function for stemming
df['cleaned_tokens']=df['cleaned_tokens'].apply(lambda x: stemming(x))

The Disadvantage of stemming is sometimes after the stemming word loses its meaning. Example: – “copying” → “copi” and there is no word “copi” in English vocab.

Also, this porter stemmer is for the English language. If we are working with other languages, we can use snowball stemmer.

6. Lemmatization

Lemmatization is very similar to stemming with the only difference being the word here will get reduced to a word that has a particular meaning in its language. Due to this, lemmatization is generally slower than stemming.
Let us use the WordNetLemmatizer in nltk to lemmatize our sentences.

from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

7. Conversion of Emoticons in Text Pre-processing

We know that on social media there is continuously increasing use of emoticons so it’s better to convert these emoticons back to some natural text so that we can get some useful content out of them.
This method can be useful for some use cases.
For implementation refer to the notebook at the end.

def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text
 
text = "Hello :-) :-)"
convert_emoticons(text)

8. Removal of URL’S

The next preprocessing step is to remove any URLs present in the data. For example, if we are doing a news data analysis, then there is a very good chance that the news article will have some URL in it. Probably we might need to remove them for our further analysis. We can use the below code snippet to do that.
For example, we will just remove https links

def remove_urls(text):
    url_pattern = re.compile(r'https?://S+|www.S+')
    return url_pattern.sub(r'', text)

9. Remove HTML tags

While scrapping data from different websites, there are very high chances that we might get html tags with that and it’s useful to remove those html tags for any further processing. We’ll use regular expressions to remove html tags

def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)
 
text = """<div>
<h1> Data</h1>
<p> News articles</p>
"""
 
print(remove_html(text))

Output :-

Data
News articles

10. Spelling Correction

Typos are very common when it comes to either social media or when users on different platforms unintentionally type wrong spellings on platforms and it is extremely useful to correct wrong spellings so that you can make a better analysis of textual data.
If we are interested in writing a spelling corrector of our own, we can probably start with the famous code from Peter Norvig.

from spellchecker import SpellChecker
 
spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

11. Removal of Rare words

This approach is very specific and, in some cases, can give you good scores based on your metrics.

from collections import Counter
cnt = Counter()
for text in df["cleaned_tokens"].values:
    for word in text:
        cnt[word] += 1
 
n_rare_words = 10
 
Rare_words = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in Rare_words])

So these are some of the text pre-processing techniques that you can use whenever you are dealing with textual data and based on your data text pre-processing techniques can change and you can make your own text pre-processing techniques.

Why Are They Useful?

To understand the impact of these techniques we will pick up a problem and will compare the results with and without text pre-processing.

The problem we will see today is spam filtering in emails and we will use the Bag of words model

That we described in detail over here but briefly Bag of words is just a Vectorisation algorithm where a text document is converted to a bag of words vector by counting how many times a word appears in a text document.

For this problem, I have used a few of the text pre-processing techniques from above and we got a good improvement in our results.

Results Time

When we did not use any text pre-processing, we got an accuracy of nearly ~88%.
When we used text pre-processing we got an accuracy of nearly ~92% so that means an increase of nearly 4.5% which is fair enough for this dataset.

Hope this blog helps you understand what is text pre-processing and how you can understand your data and make your own text pre-processing techniques.

I have attached both the notebooks below with and without text pre-processing for your convenience.

Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.

Click here to submit its review with Shiksha Online.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio