Matching Strings: An Introduction to Knuth-Morris-Pratt (KMP) Algorithm
Knuth-Morris-Pratt or KMP is a linear string matching algorithm with low time complexity of O(n+m), where n is the string length and m is the pattern length. In this article, we will what KMP algorithm is, how it works, algorithm for LPP table.
In this article, we will understand the working of the KMP string-matching algorithm. We will also solve an example using the KMP algorithm. Before starting this article, we will first understand the need for the KMP algorithm and what were the problems of the naïve string-matching algorithm.
Table of Content
- Introduction
- What is the Need for the KMP Algorithm
- What is KMP Algorithm
- Working of KMP Algorithm with an Example
- Algorithm for the LLP Table
- Step-by-Step KMP Algorithm
Introduction
KMP is a linear string-matching algorithm with low time complexity. It is the solution to the drawbacks of previous text-matching algorithms. You probably have used the advantages of the string-matching algorithm in your life. For example, while working on Microsoft applications like Microsoft Word, you might use the “Find” option to identify similar words. A pattern-matching algorithm supports the functionality of this Microsoft Word feature. In this article, we will understand the working of the KMP string-matching algorithm. We will also solve an example using the KMP algorithm.
To begin with, we should first understand why the KMP algorithm was needed and what the naive string-matching algorithms had problems with.
Know more about Data Structure and Algorithms
Must Check: Data Structure and Algorithms Online Courses and Certificates
Best-suited Data Structures and Algorithms courses for you
Learn Data Structures and Algorithms with these high-rated online courses
What is the Need for the KMP Algorithm?
To match a pattern and a string an application use string-matching algorithms. The simplest algorithm is the basic or naïve string-matching algorithm.
The basic algorithms have some drawbacks and are not effective enough to be used for string-matching applications.
- The basic string-matching algorithms consume more time.
- They use back-tracking whenever there is a mismatch in the string character and a matching pattern character.
- With the increasing length of the sample pattern and original string, it is not feasible to repeat the matching process whenever there is a mismatch.
- The repeated process consumes more memory and time.
- Their time complexity is O(n*m), where n is the length of the text or string and m is the length of the pattern.
To overcome the problems with the naïve algorithm. There was a need for a fruitful and effective string-matching algorithm. This requirement gives birth to the KMP algorithm.
What is KMP Algorithm?
KMP is the fastest string-matching algorithm. Its full form is Knuth Morris Pratt Algorithm. The KMP algorithm got this name from the name of its inventors. In 1970, James H Morris, Vaughan Pratt, and Donald Knuth invented this linear string-matching algorithm.
The KMP is the only string-matching algorithm with a time complexity of O(n+m), where n is the string length and m is the pattern length.
In string-matching algorithms, there are two terminologies: sting or text and pattern.
- String or text is the original string that is used for matching.
- The pattern is the sample text which is to be matched.
Working of KMP Algorithm with an Example
In this section, we will discuss the working of the KMP algorithm with an example. In the KMP algorithm, an index number is used to track each character of the string and pattern. A variable is used to trace the index numbers.
The matching of each character of text and pattern occurs between two strings.
There is a pie table or Longest Proper Prefix (LPP) table. For making an NPP table, there are two terms known as prefixes and suffixes in a string.
There is a pie table or Longest Proper Prefix (LPP) table. For making an NPP table, there are two terms prefixes, and suffixes in a string.
For example, in a string:
a b a b
- The prefix for the above string can be: a, ab, or aba. The prefix excludes the last string character.
- The suffix for the above string can be: b, ba, or bab. It excludes the first character of the string.
Consider an example to make an LPP table.
String = ababcb
And the LPP table is shown below:
Index value: 1 2 3 4 5
a | b | a | b | d |
0 | 0 | 1 | 2 | 0 |
Sample string
LLP values
- We check that, in the string the first “a” comes for the first time so its value is 0.
- “b” comes the first time in the string so its value is 0.
- The next character in the string is “a” it is already present with an index value of 1 so its value will be 1.
- The next character “b” is repeating and it is present at index value 2, so its value will be 2.
- String character “c” comes the first time in the string, so its value will be 0.
This is how you can make an LPP table by using the same index values for repeating characters.
Working of the Knuth Morris Pratt Algorithm
String = ababcabcab
We take two variables “i” and “j” respectively for the index values of the string and pattern. The value of variable j is 0. With every value of variable i, the next value of j will be used for comparison.
- When i = 1, j= 0.
- When i =2, j = 1.
- When i = 3, j = 2.
- When i = 4, j = 3.
- When i = 5, j = 4.
- So, the value of variable i = 5 and j at 5 denotes the value 0 as per LLP table value of character d. The value of variable i will increase and it does not start from the starting point like in a naïve string-matching algorithm.
- Now, i = 6, j = 0
- When i = 7, j = 1.
- When i = 8, j = 2.
- It is a mismatch at j = 2. Looking at its value in the LLP table. The value of j = 0 and i will increase to the next value.
- When i = 9, j = 0.
Comparing the next value with the “i” value.
- When i =10, j = 1.
In this way, we can match strings.
Algorithm for the LLP Table
- Take two variables i and j to trace the pattern.
- Initially, i <- 0, and j <- 2.
- Set LLP [0] = 0. The LLP value of the first character is always 0.
- if pattern [i] == pattern [j+1]
LLP [q] <- j+1
- if pattern [i]! = pattern[j+1]
LLP [q] <- 0
Step by Step Knuth Morris Pratt Algorithm
- Take two variables x and y to trace the index values of the string and pattern respectively.
- x <- 1, y <- 0, len <- length of the string.
- While x < string[len]
If string [x] == pattern [y + 1]
It is a match.
x <- x+1 and y <- y+1.
- else If string [x]!= pattern [y+1]
It is a mismatch.
x <- x+1 and y <- LLP[y]
Conclusion
The KMP algorithm is used in plagiarism, DNA sequencing, checking word spelling, and many more. It is the fastest string-matching algorithm with an O(m+n) time complexity. The KMP algorithm is a little harder to understand than the naïve algorithm.
Contributed By: Sonal Meenu Singh
FAQs
How does the KMP algorithm work?
The KMP algorithm works by precomputing a "failure function" or "longest prefix suffix" array for the pattern string. This array helps determine the number of characters to skip when a mismatch occurs during the matching process. By utilizing this information, the algorithm avoids rechecking previously matched characters.
What is the time complexity of the KMP algorithm?
The time complexity of the KMP algorithm is O(n + m), where n is the length of the text string and m is the length of the pattern string. It achieves linear time complexity by avoiding unnecessary character comparisons during the matching process.
Are there any limitations to the KMP algorithm?
The KMP algorithm is primarily designed for exact string matching. It does not handle approximate or fuzzy matching scenarios, where slight variations in the pattern are allowed. For approximate matching, other algorithms like the Boyer-Moore algorithm or the Rabin-Karp algorithm may be more suitable.
This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio