Tokenization in NLP | Techniques to Apply Tokenization in Python

Tokenization in NLP | Techniques to Apply Tokenization in Python

5 mins read670 Views Comment
Atul
Atul Harsha
Senior Manager Content
Updated on Jan 7, 2023 13:42 IST

In this blog, we will discuss the process of tokenization, which involves breaking down a string of text into smaller units called tokens. This is an important step in natural language processing tasks.

2023_01_Tokenization.jpg

Tokenization is breaking down a text into smaller pieces called “tokens.” These tokens can be words, phrases, or even letters or numbers. Tokenization is often used in natural language processing (NLP), a way for computers to understand and analyze human language. Tokenization helps computers understand the meaning of words and how they relate to each other in a sentence. It can be like taking a puzzle apart and putting it together again, except with words instead of puzzle pieces!

Here’s an example of tokenization:

Sentence: “The quick brown fox jumps over the lazy dog.”

If we were to tokenize this sentence, we might break it down into the following tokens:

  • “The”
  • “quick”
  • “brown”
  • “fox”
  • “jumps”
  • “over”
  • “the”
  • “lazy”
  • “dog”

Each of these tokens is a smaller piece of the larger sentence. We can then use these tokens to analyze the sentence and understand its meaning. For example, we might use the tokens to identify the subject of the sentence (the “fox”) or the verb (the “jumps”). Tokenization is a useful tool for computers to understand and analyze human language.

Why is Tokenization Essential for NLP?

Tokenization is essential for natural language processing (NLP) because it allows computers to understand and analyze human language. Without tokenization, it would be difficult for computers to identify the individual words and phrases in a sentence and understand their meaning.

Here are a few specific reasons why tokenization is essential for NLP:

  1. Preprocessing text data. 
  2. Build a vocabulary. 
  3. Training a language model.
  4. Other real-world applications of NLP such as language translation, text summarization, and sentiment analysis.

Popular Courses in Natural Language Processing

Understanding Part-of-Speech Tagging in NLP: Techniques and Applications
Text Classification with BERT
Extracting Information from Text Data Using Spacy in NLP
Recommended online courses

Best-suited NLP and Text Mining courses for you

Learn NLP and Text Mining with these high-rated online courses

– / –
13 hours
– / –
3 months
Free
16 weeks
– / –
12 weeks
– / –
6 hours
– / –
15 hours
Free
11 hours
Free
– / –
Free
1 hours

Types of Tokenization

There are several different types of tokenization that are commonly used in natural language processing (NLP). Here are some examples of different types of tokenization, along with explanations and examples:

1. Word Tokenization:

A piece of text is divided into individual words. For example,

  • Sentence: “The quick brown fox jumps over the lazy dog”
  • Tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”.

2. Sentence Tokenization:

This technique involves breaking down a piece of text into individual sentences. For example,

  • Paragraph: “The quick brown fox jumps over the lazy dog. It was a sunny day. The fox was very happy.”
  • Tokens: “The quick brown fox jumps over the lazy dog.”, “It was a sunny day.”, “The fox was very happy.”

3. N-gram Tokenization:

N-gram tokenization involves creating contiguous sequences of words from a piece of text.

  • Paragraph: “The quick brown fox jumps over the lazy dog”
  • Token bi-gram: “The quick”, “quick brown”, “brown fox”, “fox jumps”, “jumps over”, “over the”, “the lazy”, “lazy dog”.

4. Stemming:

Stemming is a type of tokenization that involves reducing a word to its base form, or stem.

  • For example, the stem of the word “jumps” is “jump”, and the stem of the word “jumping” is also “jump”.

5. Lemmatization:

It is similar to stemming, but it involves reducing a word to its base form while also taking into account the word’s part of speech. Lemmatization is often used in NLP because it can produce more meaningful and accurate tokens than stemming.

  • For example, the lemma of the verb “jumps” is “jump”, and the lemma of the noun “jumps” is “jump”.

6. White space Tokenization:

This technique involves dividing a piece of text into tokens based on white space characters, such as spaces, tabs, and newline characters. For example,

  • Sentence: “The quick brown fox jumps over the lazy dog”
  • Word tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”.

7. Punctuation Tokenization:

This technique involves dividing a piece of text into tokens based on punctuation marks, such as periods, commas, and exclamation points. For example,

  • Sentence : “The quick brown fox jumps over the lazy dog!”
  • Tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “!”.

8. Regular expression Tokenization:

This technique uses a regular expression pattern to divide a text into tokens.

  • For example, you could use the regular expression “w+” to tokenize a piece of text into words, or “d+” to tokenize it into numbers.

Different Techniques to Apply Tokenization in Python

Tokenization Using Split Function

Python’s split() function is a very simple and easy-to-use method for tokenizing a string. It splits a string into a list of substrings based on a specified delimiter. For example:

 
string = "This is a sentence. Here is another one."
tokens = string.split()
print(tokens)
Copy code

Output:

 
['This', 'is', 'a', 'sentence.', 'Here', 'is', 'another', 'one.']
Copy code

By default, the split() function uses whitespace characters as the delimiter, but you can specify a different delimiter if you want. For example:

 
string = "This,is,a,sentence. Here is another one."
tokens = string.split(",")
print(tokens)
Copy code

Output:

 
['This', 'is', 'a', 'sentence. Here is another one.']
Copy code

Tokenization Using Regular Expressions (RegEx)

Regular expressions (RegEx) can also be used for tokenization in Python. RegEx is a powerful tool for matching patterns in text, and it can be used to extract tokens from a string based on specific patterns.

To use RegEx for tokenization, you will need to import the re module and use the re.split() function. For example:

 
import re
string = "This is a sentence. Here is another one."
#s+ RegEx pattern matches one or more whitespace characters and
#split the string on any sequence of whitespace characters
tokens = re.split(r's+', string)
print(tokens)
Copy code

Output:

 
['This', 'is', 'a', 'sentence.', 'Here', 'is', 'another', 'one.']
Copy code

You can modify the RegEx pattern to split the string on different types of delimiters. For example, the following pattern will split the string on any sequence of non-word characters:

 
import re
string = "This is a sentence. Here is another one."
#w+ RegEx pattern matches any sequence of word characters (letters, digits, and underscores) surrounded by word boundaries.
tokens = re.findall(r'w+|S+', string)
print(tokens)
Copy code

Output:

 
['This', 'is', 'a', 'sentence', '.', 'Here', 'is', 'another', 'one', '.']
Copy code

Tokenization Using NLTK

The Natural Language Toolkit (NLTK) is a popular Python library for natural language processing that provides a number of tools for tokenization. To use NLTK for tokenization, you will first need to install the library. Once installed you can use the word_tokenize function from the nltk.tokenize module to tokenize a string. For example:

 
import nltk
string = "This is a sentence. Here is another one."
tokens = nltk.word_tokenize(string)
print(tokens)
Copy code

Output:

 
['This', 'is', 'a', 'sentence', '.', 'Here', 'is', 'another', 'one', '.']
Copy code

You can also use the sent_tokenize function to split the string into a list of sentences and the regexp_tokenize function to split the string using a regular expression. For example:

 
import nltk
string = "This is a sentence. Here is another one."
# Tokenize into sentences
sents = nltk.sent_tokenize(string)
print(sents)
# Tokenize using a regular expression
tokens = nltk.regexp_tokenize(string, r'w+')
print(tokens)
Copy code

Output:

 
['This is a sentence.', 'Here is another one.']
['This', 'is', 'a', 'sentence', 'Here', 'is', 'another', 'one']
Copy code

Tokenization Using Spacy

SpaCy is a popular Python library for natural language processing that provides a fast and efficient way to tokenize text. Here’s an example of how to use SpaCy for tokenization:

 
# Import the spacy module and load the English language model
import spacy
nlp = spacy.load("en")
# Create a Doc object from the text
text = "This is a sentence. Here is another one."
doc = nlp(text)
# Iterate over the tokens in the Doc and print their text
for token in doc:
# Each token is an object with various properties and methods
# The `text` attribute returns the token's text
print(token.text)
# Iterate over the sentences in the Doc
for sent in doc.sents:
# Each sentence is a Span object with various properties and methods
# The `text` attribute returns the sentence's text
print(sent.text)
# Iterate over the noun chunks in the Doc
for chunk in doc.noun_chunks:
# Each noun chunk is a Span object with various properties and methods
# The `text` attribute returns the noun chunk's text
print(chunk.text)
Copy code

Output:

 
This
is
a
sentence
.
Here
is
another
one
.
This is a sentence.
Here is another one.
This
a sentence
Here
another one
Copy code

Conclusion

Tokenization is the process of breaking down a string of text into individual tokens, which can be words, punctuation marks, or other smaller units of text. In Python, there are several ways to perform tokenization, including using the split() function, regular expressions (RegEx), the Natural Language Toolkit (NLTK), and SpaCy. Each of these methods has its own advantages and disadvantages, and the appropriate method will depend on the specific needs of your application. Regardless of the method you choose, tokenization is an important step in many natural language processing tasks, and is often used to pre-process text data before further analysis.

About the Author
author-image
Atul Harsha
Senior Manager Content

Experienced AI and Machine Learning content creator with a passion for using data to solve real-world challenges. I specialize in Python, SQL, NLP, and Data Visualization. My goal is to make data science engaging an... Read Full Bio