KMP Algotirhm

Introduction

Presume we have a string S (length is m) and a pattern P (length is n) given as follow:

1
2
3
index: 1  2  3  4  5  6  7  8  9  10 11 12
S: a b a b a a b a a b a c
P: a b a a b a c

Now we want to find out is there a sub-string in S that matchs P exactly, which in this case is S[6:12].

That is what KMP algorithm does.

Proper Prefix & Proper Suffix

Before learning KMP, let’s make this clear first: what is proper prefix and proper suffix?

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.

Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

Partial Match Table

The key to KMP, it’s the Partial Match Table. Let’s see a example.

1
2
3
char:  | a | b | a | a | b | a | c |
index: | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 1 | 2 | 3 | 0 |

I see, there’s a table. But what do these values mean?

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

For example, at index 1, we have a sub-string “a”. Its proper prefix and proper suffix are both null, so it’s value is 0.

At index 2, we have a sub-string “ab”, with proper prefix “a”, and proper suffix “b”, and they do not match each other, there’s another 0.

At index 3, we have a sub-string “aba”, with proper prefix “a” and “ab”, and proper suffix “a”, “ba”. We can see that they both contains “a”, which we call “match”. And the length of “a” is 1, so its value is 1.

At index 4, we have a sub-string “abab”, proper prefix “a”, “ab”, “aba”, and proper suffix “a”, “aa”, “baa”. They both cotains “a”, which legth is 1, so there’s 1s in its value.

etc.

How to use Next Array

The next array is baes on the pattern.

What is cofusing is that on the internet, most blogs and videos, next array and PMT table are just the same thing, but in Wangdao’s text book, the PMT is different from next array.

Well, I’ll just make them be the same.

In bruce force searching, we search the S one letter by one letter, like this:

1
2
3
4
5
6
7
8
9
10
First run:
S: a b a b a a b a a b a c
P: a b a a <- mismatch
Second run:
S: a b a b a a b a a b a c
P: a <- mismatch
Third run:
S: a b a b a a b a a b a c
P: a b a a b a c <- mismatch
...

Obviously, it is very slow, and its time complexity is $O(mn)$.

Check this:

1
2
3
4
5
6
7
8
9
First run:
S: a b a |b| a a b a a b a c
P: a b a |a| <- mismatch
Second run:
S: a b a b a a b a |a| b a c
P: a b a a b a |c| <- mismatch
Third run:
S: a b a b a a b a a b a c
P: a b a a b a c <- match!

We can see that after mismatch, P is not going forward one letter by one letter., but jumps forward.

And through the next array we can know where should P jump to.

In the first run, P[4] and S[4] mismatch, and next[4 - 1] is 1 which means in the substring from P[0:4], the prefix and subfix of 1 letter are the same, so we move P and make it match the S with its proper prefix instead the subfix that mismatched.

In the second run, P[7] and S[9] mismatch, and next[7 - 1] is 3, which means the first and the last 3 letters in the substring from P[0:7] are the same, in this case is “aba”. So we move P forward till the first 3 letters take place of the last 3 letters.

etc.

Construct a Next Array

1
2
3
4
5
6
7
8
9
10
11
12
void get_next(String P, int next[]) {
int i = 1, j = 0;
next[0] = 0;
while(i < P.length) {
if(j==0 || P[i] == P[j]) {
i++; j++;
next[i] = j;
} else {
j = next[j - 1];
}
}
}

Practice

Acwing: 831.KMP字符串

References

Youtube: Abdul Bari - Knuth-Morris-Pratt KMP String Matching Algorithm

Zhihu: 阮行止 - 如何更好地理解和掌握 KMP 算法?

Blog: The Knuth-Morris-Pratt Algorithm in my own words

作者

Moodles

发布于

2021-07-08

更新于

2022-01-28

许可协议