Tokenization in NLP | Techniques to Apply Tokenization in Python
In this blog, we will discuss the process of tokenization, which involves breaking down a string of text into smaller units called tokens. This is an important step in natural language processing tasks.
Tokenization is breaking down a text into smaller pieces called “tokens.” These tokens can be words, phrases, or even letters or numbers. Tokenization is often used in natural language processing (NLP), a way for computers to understand and analyze human language. Tokenization helps computers understand the meaning of words and how they relate to each other in a sentence. It can be like taking a puzzle apart and putting it together again, except with words instead of puzzle pieces!
Here’s an example of tokenization:
Sentence: “The quick brown fox jumps over the lazy dog.”
If we were to tokenize this sentence, we might break it down into the following tokens:
- “The”
- “quick”
- “brown”
- “fox”
- “jumps”
- “over”
- “the”
- “lazy”
- “dog”
Each of these tokens is a smaller piece of the larger sentence. We can then use these tokens to analyze the sentence and understand its meaning. For example, we might use the tokens to identify the subject of the sentence (the “fox”) or the verb (the “jumps”). Tokenization is a useful tool for computers to understand and analyze human language.
Why is Tokenization Essential for NLP?
Tokenization is essential for natural language processing (NLP) because it allows computers to understand and analyze human language. Without tokenization, it would be difficult for computers to identify the individual words and phrases in a sentence and understand their meaning.
Here are a few specific reasons why tokenization is essential for NLP:
- Preprocessing text data.
- Build a vocabulary.
- Training a language model.
- Other real-world applications of NLP such as language translation, text summarization, and sentiment analysis.
Popular Courses in Natural Language Processing
Best-suited NLP and Text Mining courses for you
Learn NLP and Text Mining with these high-rated online courses
Types of Tokenization
There are several different types of tokenization that are commonly used in natural language processing (NLP). Here are some examples of different types of tokenization, along with explanations and examples:
1. Word Tokenization:
A piece of text is divided into individual words. For example,
- Sentence: “The quick brown fox jumps over the lazy dog”
- Tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”.
2. Sentence Tokenization:
This technique involves breaking down a piece of text into individual sentences. For example,
- Paragraph: “The quick brown fox jumps over the lazy dog. It was a sunny day. The fox was very happy.”
- Tokens: “The quick brown fox jumps over the lazy dog.”, “It was a sunny day.”, “The fox was very happy.”
3. N-gram Tokenization:
N-gram tokenization involves creating contiguous sequences of words from a piece of text.
- Paragraph: “The quick brown fox jumps over the lazy dog”
- Token bi-gram: “The quick”, “quick brown”, “brown fox”, “fox jumps”, “jumps over”, “over the”, “the lazy”, “lazy dog”.
4. Stemming:
Stemming is a type of tokenization that involves reducing a word to its base form, or stem.
- For example, the stem of the word “jumps” is “jump”, and the stem of the word “jumping” is also “jump”.
5. Lemmatization:
It is similar to stemming, but it involves reducing a word to its base form while also taking into account the word’s part of speech. Lemmatization is often used in NLP because it can produce more meaningful and accurate tokens than stemming.
- For example, the lemma of the verb “jumps” is “jump”, and the lemma of the noun “jumps” is “jump”.
6. White space Tokenization:
This technique involves dividing a piece of text into tokens based on white space characters, such as spaces, tabs, and newline characters. For example,
- Sentence: “The quick brown fox jumps over the lazy dog”
- Word tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”.
7. Punctuation Tokenization:
This technique involves dividing a piece of text into tokens based on punctuation marks, such as periods, commas, and exclamation points. For example,
- Sentence : “The quick brown fox jumps over the lazy dog!”
- Tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “!”.
8. Regular expression Tokenization:
This technique uses a regular expression pattern to divide a text into tokens.
- For example, you could use the regular expression “w+” to tokenize a piece of text into words, or “d+” to tokenize it into numbers.
Different Techniques to Apply Tokenization in Python
Tokenization Using Split Function
Python’s split() function is a very simple and easy-to-use method for tokenizing a string. It splits a string into a list of substrings based on a specified delimiter. For example:
string = "This is a sentence. Here is another one."tokens = string.split()print(tokens)
Output:
['This', 'is', 'a', 'sentence.', 'Here', 'is', 'another', 'one.']
By default, the split() function uses whitespace characters as the delimiter, but you can specify a different delimiter if you want. For example:
string = "This,is,a,sentence. Here is another one."tokens = string.split(",")print(tokens)
Output:
['This', 'is', 'a', 'sentence. Here is another one.']
Tokenization Using Regular Expressions (RegEx)
Regular expressions (RegEx) can also be used for tokenization in Python. RegEx is a powerful tool for matching patterns in text, and it can be used to extract tokens from a string based on specific patterns.
To use RegEx for tokenization, you will need to import the re module and use the re.split() function. For example:
import re
string = "This is a sentence. Here is another one."
#s+ RegEx pattern matches one or more whitespace characters and#split the string on any sequence of whitespace characters
tokens = re.split(r's+', string)print(tokens)
Output:
['This', 'is', 'a', 'sentence.', 'Here', 'is', 'another', 'one.']
You can modify the RegEx pattern to split the string on different types of delimiters. For example, the following pattern will split the string on any sequence of non-word characters:
import re
string = "This is a sentence. Here is another one."
#w+ RegEx pattern matches any sequence of word characters (letters, digits, and underscores) surrounded by word boundaries.tokens = re.findall(r'w+|S+', string)print(tokens)
Output:
['This', 'is', 'a', 'sentence', '.', 'Here', 'is', 'another', 'one', '.']
Tokenization Using NLTK
The Natural Language Toolkit (NLTK) is a popular Python library for natural language processing that provides a number of tools for tokenization. To use NLTK for tokenization, you will first need to install the library. Once installed you can use the word_tokenize function from the nltk.tokenize module to tokenize a string. For example:
import nltk
string = "This is a sentence. Here is another one."tokens = nltk.word_tokenize(string)print(tokens)
Output:
['This', 'is', 'a', 'sentence', '.', 'Here', 'is', 'another', 'one', '.']
You can also use the sent_tokenize function to split the string into a list of sentences and the regexp_tokenize function to split the string using a regular expression. For example:
import nltk
string = "This is a sentence. Here is another one."
# Tokenize into sentencessents = nltk.sent_tokenize(string)print(sents)
# Tokenize using a regular expressiontokens = nltk.regexp_tokenize(string, r'w+')print(tokens)
Output:
['This is a sentence.', 'Here is another one.']['This', 'is', 'a', 'sentence', 'Here', 'is', 'another', 'one']
Tokenization Using Spacy
SpaCy is a popular Python library for natural language processing that provides a fast and efficient way to tokenize text. Here’s an example of how to use SpaCy for tokenization:
# Import the spacy module and load the English language modelimport spacynlp = spacy.load("en")
# Create a Doc object from the texttext = "This is a sentence. Here is another one."doc = nlp(text)
# Iterate over the tokens in the Doc and print their textfor token in doc: # Each token is an object with various properties and methods # The `text` attribute returns the token's text print(token.text)
# Iterate over the sentences in the Docfor sent in doc.sents: # Each sentence is a Span object with various properties and methods # The `text` attribute returns the sentence's text print(sent.text)
# Iterate over the noun chunks in the Docfor chunk in doc.noun_chunks: # Each noun chunk is a Span object with various properties and methods # The `text` attribute returns the noun chunk's text print(chunk.text)
Output:
Thisisasentence.Hereisanotherone.This is a sentence.Here is another one.Thisa sentenceHereanother one
Conclusion
Tokenization is the process of breaking down a string of text into individual tokens, which can be words, punctuation marks, or other smaller units of text. In Python, there are several ways to perform tokenization, including using the split() function, regular expressions (RegEx), the Natural Language Toolkit (NLTK), and SpaCy. Each of these methods has its own advantages and disadvantages, and the appropriate method will depend on the specific needs of your application. Regardless of the method you choose, tokenization is an important step in many natural language processing tasks, and is often used to pre-process text data before further analysis.
Experienced AI and Machine Learning content creator with a passion for using data to solve real-world challenges. I specialize in Python, SQL, NLP, and Data Visualization. My goal is to make data science engaging an... Read Full Bio