Algorithm Question Substring Search KMP
Table of Contents
Question
LeetCode has this question Implement strStr().
Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
Tags: Two Pointers, String.
Let’s denote haystack length N
, needle length M
, character table size R
(256 for extended ASCII).
Java’s String
class method indexOf()
uses brute force algorithm O(MN).
Examples:
haystacsflksdjflkshhaystackneeneeneedle
needle
naslkfjskjlhhhh
needle
Method 1 2D DFA
KMP Knuth Morris Pratt O(N) time complexity.
2D array DFA[256][M] gives the next character index to match against i.
Example: needle “ABABAC”. DFA[r][j] where r < R, j < M.
j : 0 1 2 3 4 5 6
state: 0 0 1 2 3 0
r A B A B A C
A 1 1 3 1 5 1
B 0 2 0 4 0 4
C 0 0 0 0 0 6
D 0 0 0 0 0 0
Steps to build DFA:
- copy mismatched cases (first column all 0).
- set matched character to go to next state (first column DFA[needle.charAt(0)][0] = 1).
- start state 0, update state for j in [1, M-1].
DFA[needle.charAt(0)][0] = 1;
int state = 0;
for (int j = 1; j < M; j++) {
for (int r = 0; r < 256; r++) DFA[r][j] = DFA[r][state];
DFA[needle.charAt(j)][j] = j + 1;
state = DFA[needle.charAt(j)][state];
}
Note DFA[’D’][5] = 0 but DFA[‘B’][5] = 4.
for (int i = 0, j = 0; i < N && j < M; i++) {
//no backup, so can increment i
j = DFA[haystack.charAt(i)][j];
}
if (j == M) return i - M;
else return -1;
Method 2: 1D Restart Table Array
restart[M] restart[j - 1] gives which character in needle to match against index i in haystack when a mismatch happens at index j of needle.
A B A B A C
restart: 0 0 1 2 3 0
Compare suffix needle[j, M-1] with needle, longest common prefix’s length.
int state = 0;
for (int j = 1; j < M;) {
if (needle.charAt(j) == needle.charAt(state)) {
restart[j++] = ++state;
} else if (state == 0) j++;
else state = restart[state - 1];
}
To compare, now has to back up, cannot operate on stream like standard input without buffering.
for (int i = 0, j = 0; i < N - M && j < M;) {
if (needle.charAt(j) == haystack.charAt(i)) {
j++;
i++;
}
else if (j == 0) i++;
else j = restart[j - 1];
}
if (j == M) return i - j;
else return -1;
Question: worst case operations 2N, practical 1.1 N. Why 2N?
comments powered by Disqus