Tokenization in NLP | Techniques to Apply Tokenization in Python

5 mins read670 Views Comment

Senior Manager Content

Updated on Jan 7, 2023 13:42 IST

In this blog, we will discuss the process of tokenization, which involves breaking down a string of text into smaller units called tokens. This is an important step in natural language processing tasks.

Tokenization is breaking down a text into smaller pieces called “tokens.” These tokens can be words, phrases, or even letters or numbers. Tokenization is often used in natural language processing (NLP), a way for computers to understand and analyze human language. Tokenization helps computers understand the meaning of words and how they relate to each other in a sentence. It can be like taking a puzzle apart and putting it together again, except with words instead of puzzle pieces!

Here’s an example of tokenization:

Sentence: “The quick brown fox jumps over the lazy dog.”

If we were to tokenize this sentence, we might break it down into the following tokens:

“The”
“quick”
“brown”
“fox”
“jumps”
“over”
“the”
“lazy”
“dog”

Each of these tokens is a smaller piece of the larger sentence. We can then use these tokens to analyze the sentence and understand its meaning. For example, we might use the tokens to identify the subject of the sentence (the “fox”) or the verb (the “jumps”). Tokenization is a useful tool for computers to understand and analyze human language.

Why is Tokenization Essential for NLP?

Tokenization is essential for natural language processing (NLP) because it allows computers to understand and analyze human language. Without tokenization, it would be difficult for computers to identify the individual words and phrases in a sentence and understand their meaning.

Here are a few specific reasons why tokenization is essential for NLP:

Preprocessing text data.
Build a vocabulary.
Training a language model.
Other real-world applications of NLP such as language translation, text summarization, and sentiment analysis.

Understanding Part-of-Speech Tagging in NLP: Techniques and Applications

Part-of-speech (POS) tagging is the process of labeling words in a text with their corresponding parts of speech in natural language. This can include nouns, verbs, adjectives, and other grammatical...read more

Read Later

Text Classification with BERT

Text classification with BERT involves using a pre-trained transformer model to categorize text into predefined classes. BERT leverages deep learning and context from both directions of a sentence to achieve...read more

Read Later

Extracting Information from Text Data Using Spacy in NLP

In this article, we will discuss how to extract structured information from unstructured textual data using the spaCy package and its functions – namely, nlp() and Matcher() to search for...read more

Read Later

Recommended online courses

Best-suited NLP and Text Mining courses for you

Learn NLP and Text Mining with these high-rated online courses

Data Science:Data Mining & Natural Language Processing in R

EduonixCertificate

4.5

Total Fees

– / –

Duration

13 hours

Become a Natural Language Processing Expert

UDACITYCertificate

5.0

Total Fees

– / –

Duration

3 months

Cyber-Physical Systems Design & Analysis

Georgia Institute of TechnologyCertificate

Total Fees

Free

Duration

16 weeks

Applied Natural Language Processing

NPTELCertificate

Total Fees

– / –

Duration

12 weeks

Text analytics 101

Cognitive ClassCertificate

5.0

Total Fees

– / –

Duration

6 hours

Text Mining and Natural Language Processing in R

EduonixCertificate

Total Fees

– / –

Duration

15 hours

Introduction to Bioconductor

Harvard UniversityCertificate

Total Fees

Free

Duration

11 hours

Natural Language Processing (NLP) and Text Mining Tutorial for Beginners

SkillupCertificate

5.0

Total Fees

– / –

Duration

3 months

Introduction to Natural Language Processing

Analytics VidhyaCertificate

5.0

Total Fees

Free

Duration

– / –

Build a question answering solution

MicrosoftCertificate

Total Fees

Free

Duration

1 hours

Types of Tokenization

There are several different types of tokenization that are commonly used in natural language processing (NLP). Here are some examples of different types of tokenization, along with explanations and examples:

1. Word Tokenization:

A piece of text is divided into individual words. For example,

Sentence: “The quick brown fox jumps over the lazy dog”
Tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”.

2. Sentence Tokenization:

This technique involves breaking down a piece of text into individual sentences. For example,

Paragraph: “The quick brown fox jumps over the lazy dog. It was a sunny day. The fox was very happy.”
Tokens: “The quick brown fox jumps over the lazy dog.”, “It was a sunny day.”, “The fox was very happy.”

3. N-gram Tokenization:

N-gram tokenization involves creating contiguous sequences of words from a piece of text.

Paragraph: “The quick brown fox jumps over the lazy dog”
Token bi-gram: “The quick”, “quick brown”, “brown fox”, “fox jumps”, “jumps over”, “over the”, “the lazy”, “lazy dog”.

4. Stemming:

Stemming is a type of tokenization that involves reducing a word to its base form, or stem.

For example, the stem of the word “jumps” is “jump”, and the stem of the word “jumping” is also “jump”.

5. Lemmatization:

It is similar to stemming, but it involves reducing a word to its base form while also taking into account the word’s part of speech. Lemmatization is often used in NLP because it can produce more meaningful and accurate tokens than stemming.

For example, the lemma of the verb “jumps” is “jump”, and the lemma of the noun “jumps” is “jump”.

6. White space Tokenization:

This technique involves dividing a piece of text into tokens based on white space characters, such as spaces, tabs, and newline characters. For example,

Sentence: “The quick brown fox jumps over the lazy dog”
Word tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”.

7. Punctuation Tokenization:

This technique involves dividing a piece of text into tokens based on punctuation marks, such as periods, commas, and exclamation points. For example,

Sentence : “The quick brown fox jumps over the lazy dog!”
Tokens: “The”, “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “!”.

8. Regular expression Tokenization:

This technique uses a regular expression pattern to divide a text into tokens.

For example, you could use the regular expression “w+” to tokenize a piece of text into words, or “d+” to tokenize it into numbers.

Different Techniques to Apply Tokenization in Python

Tokenization Using Split Function

Python’s split() function is a very simple and easy-to-use method for tokenizing a string. It splits a string into a list of substrings based on a specified delimiter. For example:

string = "This is a sentence. Here is another one."
tokens = string.split()
print(tokens)
Copy code

Output:

['This', 'is', 'a', 'sentence.', 'Here', 'is', 'another', 'one.']
Copy code

By default, the split() function uses whitespace characters as the delimiter, but you can specify a different delimiter if you want. For example:

string = "This,is,a,sentence. Here is another one."
tokens = string.split(",")
print(tokens)
Copy code

Output:

['This', 'is', 'a', 'sentence. Here is another one.']
Copy code

Tokenization Using Regular Expressions (RegEx)

Regular expressions (RegEx) can also be used for tokenization in Python. RegEx is a powerful tool for matching patterns in text, and it can be used to extract tokens from a string based on specific patterns.

To use RegEx for tokenization, you will need to import the re module and use the re.split() function. For example:

import re

string = "This is a sentence. Here is another one."


#s+ RegEx pattern matches one or more whitespace characters and
#split the string on any sequence of whitespace characters

tokens = re.split(r's+', string)
print(tokens)
Copy code

Output:

['This', 'is', 'a', 'sentence.', 'Here', 'is', 'another', 'one.']
Copy code

You can modify the RegEx pattern to split the string on different types of delimiters. For example, the following pattern will split the string on any sequence of non-word characters:

import re

string = "This is a sentence. Here is another one."

#w+ RegEx pattern matches any sequence of word characters (letters, digits, and underscores) surrounded by word boundaries.
tokens = re.findall(r'w+|S+', string)
print(tokens)
Copy code

Output:

['This', 'is', 'a', 'sentence', '.', 'Here', 'is', 'another', 'one', '.']
Copy code

Tokenization Using NLTK

The Natural Language Toolkit (NLTK) is a popular Python library for natural language processing that provides a number of tools for tokenization. To use NLTK for tokenization, you will first need to install the library. Once installed you can use the word_tokenize function from the nltk.tokenize module to tokenize a string. For example:

import nltk

string = "This is a sentence. Here is another one."
tokens = nltk.word_tokenize(string)
print(tokens)
Copy code

Output:

['This', 'is', 'a', 'sentence', '.', 'Here', 'is', 'another', 'one', '.']
Copy code

You can also use the sent_tokenize function to split the string into a list of sentences and the regexp_tokenize function to split the string using a regular expression. For example:

import nltk

string = "This is a sentence. Here is another one."

# Tokenize into sentences
sents = nltk.sent_tokenize(string)
print(sents)

# Tokenize using a regular expression
tokens = nltk.regexp_tokenize(string, r'w+')
print(tokens)
Copy code

Output:

['This is a sentence.', 'Here is another one.']
['This', 'is', 'a', 'sentence', 'Here', 'is', 'another', 'one']
Copy code

Tokenization Using Spacy

SpaCy is a popular Python library for natural language processing that provides a fast and efficient way to tokenize text. Here’s an example of how to use SpaCy for tokenization:

# Import the spacy module and load the English language model
import spacy
nlp = spacy.load("en")

# Create a Doc object from the text
text = "This is a sentence. Here is another one."
doc = nlp(text)

# Iterate over the tokens in the Doc and print their text
for token in doc:
    # Each token is an object with various properties and methods
    # The `text` attribute returns the token's text
    print(token.text)

# Iterate over the sentences in the Doc
for sent in doc.sents:
    # Each sentence is a Span object with various properties and methods
    # The `text` attribute returns the sentence's text
    print(sent.text)

# Iterate over the noun chunks in the Doc
for chunk in doc.noun_chunks:
    # Each noun chunk is a Span object with various properties and methods
    # The `text` attribute returns the noun chunk's text
    print(chunk.text)
Copy code

Output:

This
is
a
sentence
.
Here
is
another
one
.
This is a sentence.
Here is another one.
This
a sentence
Here
another one
Copy code

Conclusion

Tokenization is the process of breaking down a string of text into individual tokens, which can be words, punctuation marks, or other smaller units of text. In Python, there are several ways to perform tokenization, including using the split() function, regular expressions (RegEx), the Natural Language Toolkit (NLTK), and SpaCy. Each of these methods has its own advantages and disadvantages, and the appropriate method will depend on the specific needs of your application. Regardless of the method you choose, tokenization is an important step in many natural language processing tasks, and is often used to pre-process text data before further analysis.

About the Author

Atul Harsha

Senior Manager Content

Experienced AI and Machine Learning content creator with a passion for using data to solve real-world challenges. I specialize in Python, SQL, NLP, and Data Visualization. My goal is to make data science engaging an... Read Full Bio

Tokenization in NLP | Techniques to Apply Tokenization in Python

Why is Tokenization Essential for NLP?

Best-suited NLP and Text Mining courses for you

Data Science:Data Mining & Natural Language Processing in R

Become a Natural Language Processing Expert

Cyber-Physical Systems Design & Analysis

Applied Natural Language Processing

Text analytics 101

Text Mining and Natural Language Processing in R

Introduction to Bioconductor

Natural Language Processing (NLP) and Text Mining Tutorial for Beginners

Introduction to Natural Language Processing

Build a question answering solution

Types of Tokenization

1. Word Tokenization:

2. Sentence Tokenization:

3. N-gram Tokenization:

4. Stemming:

5. Lemmatization:

6. White space Tokenization:

7. Punctuation Tokenization:

8. Regular expression Tokenization:

Different Techniques to Apply Tokenization in Python

Tokenization Using Split Function

Tokenization Using Regular Expressions (RegEx)

Tokenization Using NLTK

Tokenization Using Spacy

Conclusion

Top Picks & New Arrivals