What is Rabin Karp Algorithm?
Before we explore what Rabin Karp Algorithm is. Let’s first understand what string pattern matching algorithms are.
A string pattern matching algorithm is an algorithm that is used to find specific patterns or substrings within larger strings of text. These algorithms work by comparing the characters in a pattern to the characters in a text and identifying matches between the two.
You can also explore: A Simple Explanation of the Bag of Words (BoW) Model
These algorithms have a wide range of applications in computer science, including:
- Text editing and search: String pattern matching algorithms are used to search for specific words or phrases within a document, and can be used to implement features like “find and replace” in text editors.
- Text analysis: String pattern matching algorithms can be used to extract information from large bodies of text, such as counting the frequency of specific words or identifying common patterns in written language.
- Data compression: String pattern matching algorithms can be used to compress data by identifying and replacing repetitive patterns with shorter representations.
- Data validation: String pattern matching algorithms can be used to validate inputs such as passwords, email address or IP addresses.
- Plagiarism detection: String pattern matching algorithms can be used to detect plagiarism by comparing text across multiple documents and identifying matching patterns.
- Speech recognition: String pattern matching algorithms can be used to identify patterns in speech, such as phonemes or words, to enable speech recognition.
- Bioinformatics: String pattern matching algorithms can be used to identify patterns in DNA sequences, such as genes or regulatory regions.
There are several different types of string pattern matching algorithms, each with their own strengths and weaknesses. The most basic and well-known algorithm is the “Naive” or “Brute Force” algorithm, which simply checks every possible position in the text to see if the pattern appears there.
You can also explore: Introduction To Backtracking Algorithm
Other string pattern matching algorithms
Other string pattern matching algorithms include the Knuth-Morris-Pratt (KMP) algorithm, which preprocesses the pattern to reduce the number of comparisons needed; the Boyer-Moore algorithm, which uses a combination of character-matching heuristics and a “bad-character” table to quickly skip over non-matching characters in the text; Rabin-Karp algorithm uses a hash function to quickly identify if a given pattern appears in a text, it uses the hash values to quickly determine if the pattern is a match for the current window, reducing the need for a character-by-character comparison in most cases.
We will look into more detailed explanation of Rabin-Karp Algorithm.
Introduction to Rabin Karp Algorithm
The Rabin-Karp algorithm is a string-matching algorithm that uses a hash function to quickly identify if a given pattern appears in a text.
The algorithm’s main advantage is that it can quickly identify whether a pattern appears in a text without having to check every possible position in the text. This makes it well-suited for certain types of problems, such as searching for a specific string in a large document or looking for plagiarism in a set of papers.
You can also explore: Curse of Dimensionality
Best-suited Data Structures and Algorithms courses for you
Learn Data Structures and Algorithms with these high-rated online courses
Working of Rabin Karp Algorithm
Here’s an overview of how the Rabin-Karp algorithm works:
- First, a hash value is calculated for the pattern. This is done by treating the characters in the pattern as integers, and then performing some mathematical operations on them to produce a single number (the hash value). The exact operations used will depend on the hash function being used.
- Next, the algorithm scans the text one character at a time, starting from the first character. At each position in the text, it calculates the hash value for the substring of the text that is the same length as the pattern.
- If the hash value for the substring of the text matches the hash value for the pattern, the algorithm will compare the substring to the pattern character by character to see if they are an exact match.
- If the substring and the pattern are not an exact match, the algorithm continues scanning the text, calculating new hash values for the substrings as it goes.
- If the substring and the pattern are an exact match, the algorithm reports that the pattern has been found in the text at the current position.
Code:
public static void search(String text, String pattern) {
int d = 256; // d is number of characters in input alphabet
int q = 103; // prime number
int n = text.length();
int m = pattern.length();
int h = 1;
int p = 0; // hash value for pattern
int t = 0; //hash value for text
// The value of h would be “Math.pow(d, M-1)%q”
for (int i = 0; i < m-1; i++) {
h = (h*d) % q;
}
// Calculate the hash value of pattern and first window of text
for (int i = 0; i < m; i++) {
p = (d*p + pattern.charAt(i)) % q;
t = (d*t + text.charAt(i)) % q;
}
// Slide the pattern over text one by one
for (int i = 0; i <= n-m; i++) {
// Check the hash values of current window of text and pattern
// If the hash values match then only check for characters one by one
if (p == t) {
int j;
for (j = 0; j < m; j++) {
if (text.charAt(i+j) != pattern.charAt(j)) {
break;
}
}
if (j == m) {
System.out.println(“Pattern found at index”+i);
}
}
// Calculate hash value for next window of text
if (i < n-m) {
t = (d*(t – text.charAt(i)*h) + text.charAt(i+m)) % q;
// We might get negative values of t, converting it to positive
if (t < 0)
t = (t + q);
}
}
}
e.g.-
Let’s say we have a text “ABABDABACDABABCABAB” and a pattern “ABABCABAB”, We want to find the index of the first occurrence of the pattern in the text using Rabin-Karp algorithm.
You can also explore: Difference Between Flowchart and Algorithm
Here’s how the algorithm would work step by step:
- First, we calculate a hash value for the pattern using a hash function.
- Next, we scan the text one character at a time, starting from the first character. At each position in the text, we calculate the hash value for the substring of the text that is the same length as the pattern. In this case, we would calculate the hash value of the substring “ABABDABA” (at index 0 of the text).
- Then we compare the calculated hash value of the substring to the hash value of the pattern. If they match, we compare the substring with the pattern character by character to see if they are an exact match. In this case, the hash values do not match, we move to next index, (at index 1 of the text) and now we calculate the hash value of the substring “BABDABA”
- We repeat the above step for every index until we find a match. In this case, we repeat the process several times until we reach the index 10 of the text, where we find a match between the substring “ABABCABAB” and the pattern “ABABCABAB”, so we will return index 10 as the starting index of the pattern in the text.
Time complexity:
The running time in the worst case scenario O ((n-m+1) *m).
Average and best case running time is O (m+n) where m is length of pattern and n is length of text.
Space complexity: O(1)
Implementation:
a | a | a | a | a | b |
Text=
a | a | b |
Pattern =
For sake of simplicity, let’s assume addition as our hash function and a=1, b=2, c=3, d=4, e=5………….,z=26.
Hash value of pattern= (1+1+2)=4
- Hash value of substring [0-2] = aaa (1+1+1)=3
3 !=4 —–> No match
- Hash value of substring [1-3]= aaa =3
3!=4 —–> No match
- Hash value of substring [2-4] = aaa=3
3!=4 —> No match
- Hash value of substring [3-5] = aab=4
4=4 —-> Match
Author: Juhi Priya
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio