string - Python: Rabin-Karp algorithm hashing -
i implementing rabin-karp algorithm fun. came across pseudocode:
    rabin -karp -matcher (t, p, d, q)     1 n = t.length     2 m = p.length     3 h = d^(m-1) mod q     4 p=0     5 t= 0     6 = 1 m     / preprocessing     /     7 p = (dp + p [i]) mod q     8 t = (dt + t [i]) mod q     9 s = 0 n-m     / matching     /     10     if p == t     11         if p [1... m] == t [s + 1...s + m]     12             print “pattern occurs shift” s     13     if s < n-m     14         t  = (d(t-t[s + 1]h) + t [s + m + 1]) mod q i implemented in python 2.7 so:
def rabin_karp_matcher(text, pattern, d, q):     n = len(text)     m = len(pattern)     h = pow(d,m-1)%q     p = 0     t =0     result = []     in range(m): # preprocessing         p = (d*p+ord(pattern[i]))%q         t = (d*t+ord(text[i]))%q     s in range(n-m):         if p == t: # check character character             match = true             in range(m):                 if pattern[i] != text[s+i]:                     match = false                     break             if match:                 result = result + [s]         if s < n-m:                 t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here     return result where result list containing index in text of pattern.
my code failing find 26 in 3141592653589793 , suspect has hashcode defined line 14 of pseudocode. can please this.
i passed in following paramters:
p = "26" t = "3141592653589793" d = 257 q = 11
p , t must strings/arrays of chars
here working version of code:
def rabin_karp_matcher(text, pattern, d, q):     n = len(text)     m = len(pattern)     h = pow(d,m-1)%q     p = 0     t = 0     result = []     in range(m): # preprocessing         p = (d*p+ord(pattern[i]))%q         t = (d*t+ord(text[i]))%q     s in range(n-m+1): # note +1         if p == t: # check character character             match = true             in range(m):                 if pattern[i] != text[s+i]:                     match = false                     break             if match:                 result = result + [s]         if s < n-m:             t = (t-h*ord(text[s]))%q # remove letter s             t = (t*d+ord(text[s+m]))%q # add letter s+m             t = (t+q)%q # make sure t >= 0     return result print (rabin_karp_matcher ("3141592653589793", "26", 257, 11)) print (rabin_karp_matcher ("xxxxx", "xx", 40999999, 999999937)) the output is:
[6] [0, 1, 2, 3] on first step, check whether text[0..m] == pattern. on second step, want check whether text[1..m+1] == pattern. remove text[0] hash (at moment multiplied precomputed h): t = (t-h*ord(text[s]))%q. , then, add text[m] it: t = (t*d+ord(text[s+m]))%q. on next step, remove text[1] , add text[m+1], , on. t = (t+q)%q line there because negative number modulo q yields remainder in range (-q; 0], , want in range [0; q).
note want check total of n-m+1 substrings, not n-m, hence correct loop for s in range(n-m+1). checked second example (finding "xx" in "xxxxx").
also worth noting:
- the line - h = pow(d,m-1)%qmay slow if- mlarge. better take result modulo- qafter each of- m-2multiplications.
- this algorithm still o(nm) in worst case. - text="a"*100000,- pattern="a"*50000, find 50001 positions substring of text matches pattern, , check them character-by-character. if expect code work fast in such extreme cases, should skip character-by-character comparison , find way deal false positives (i.e., hashes equal strings not). picking large prime number- qmay reduce probability of false positive acceptable level.
Comments
Post a Comment