Algorithm Question Substring Search KMP

Fri, Aug 5, 2016
Category: algorithm Tags: [java] [string]

Question

LeetCode has this question Implement strStr().

Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Tags: Two Pointers, String.

Let’s denote haystack length N, needle length M, character table size R (256 for extended ASCII).

Java’s String class method indexOf() uses brute force algorithm O(MN).

Examples:

haystacsflksdjflkshhaystackneeneeneedle
         needle
naslkfjskjlhhhh
needle

Method 1 2D DFA

KMP Knuth Morris Pratt O(N) time complexity.

2D array DFA[256][M] gives the next character index to match against i.

Example: needle “ABABAC”. DFA[r][j] where r < R, j < M.

j    :  0 1 2 3 4 5 6
state:  0 0 1 2 3 0
r       A B A B A C
A       1 1 3 1 5 1 
B       0 2 0 4 0 4
C       0 0 0 0 0 6
D       0 0 0 0 0 0

Steps to build DFA:

  1. copy mismatched cases (first column all 0).
  2. set matched character to go to next state (first column DFA[needle.charAt(0)][0] = 1).
  3. start state 0, update state for j in [1, M-1].
DFA[needle.charAt(0)][0] = 1;
int state = 0;
for (int j = 1; j < M; j++) {
  for (int r = 0; r < 256; r++) DFA[r][j] = DFA[r][state];
  DFA[needle.charAt(j)][j] = j + 1;    
  state = DFA[needle.charAt(j)][state];
}

Note DFA[’D’][5] = 0 but DFA[‘B’][5] = 4.

for (int i = 0, j = 0; i < N && j < M; i++) {
  //no backup, so can increment i
  j = DFA[haystack.charAt(i)][j];
}
if (j == M) return i - M;
else return -1;

Method 2: 1D Restart Table Array

restart[M] restart[j - 1] gives which character in needle to match against index i in haystack when a mismatch happens at index j of needle.

          A B A B A C
restart:  0 0 1 2 3 0

Compare suffix needle[j, M-1] with needle, longest common prefix’s length.

int state = 0;
for (int j = 1; j < M;) {
  if (needle.charAt(j) == needle.charAt(state)) {
    restart[j++] = ++state;
  } else if (state == 0) j++;
  else state = restart[state - 1];
}

To compare, now has to back up, cannot operate on stream like standard input without buffering.

for (int i = 0, j = 0; i < N - M && j < M;) {
  if (needle.charAt(j) == haystack.charAt(i)) {
    j++;
    i++;
  }
  else if (j == 0) i++;
  else j = restart[j - 1]; 
}
if (j == M) return i - j;
else return -1;

Question: worst case operations 2N, practical 1.1 N. Why 2N?

comments powered by Disqus