string - Python: Rabin-Karp algorithm hashing -
i implementing rabin-karp algorithm fun. came across pseudocode:
rabin -karp -matcher (t, p, d, q) 1 n = t.length 2 m = p.length 3 h = d^(m-1) mod q 4 p=0 5 t= 0 6 = 1 m / preprocessing / 7 p = (dp + p [i]) mod q 8 t = (dt + t [i]) mod q 9 s = 0 n-m / matching / 10 if p == t 11 if p [1... m] == t [s + 1...s + m] 12 print “pattern occurs shift” s 13 if s < n-m 14 t = (d(t-t[s + 1]h) + t [s + m + 1]) mod q
i implemented in python 2.7 so:
def rabin_karp_matcher(text, pattern, d, q): n = len(text) m = len(pattern) h = pow(d,m-1)%q p = 0 t =0 result = [] in range(m): # preprocessing p = (d*p+ord(pattern[i]))%q t = (d*t+ord(text[i]))%q s in range(n-m): if p == t: # check character character match = true in range(m): if pattern[i] != text[s+i]: match = false break if match: result = result + [s] if s < n-m: t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here return result
where result list containing index in text of pattern.
my code failing find 26 in 3141592653589793 , suspect has hashcode defined line 14 of pseudocode. can please this.
i passed in following paramters:
p = "26" t = "3141592653589793" d = 257 q = 11
p , t must strings/arrays of chars
here working version of code:
def rabin_karp_matcher(text, pattern, d, q): n = len(text) m = len(pattern) h = pow(d,m-1)%q p = 0 t = 0 result = [] in range(m): # preprocessing p = (d*p+ord(pattern[i]))%q t = (d*t+ord(text[i]))%q s in range(n-m+1): # note +1 if p == t: # check character character match = true in range(m): if pattern[i] != text[s+i]: match = false break if match: result = result + [s] if s < n-m: t = (t-h*ord(text[s]))%q # remove letter s t = (t*d+ord(text[s+m]))%q # add letter s+m t = (t+q)%q # make sure t >= 0 return result print (rabin_karp_matcher ("3141592653589793", "26", 257, 11)) print (rabin_karp_matcher ("xxxxx", "xx", 40999999, 999999937))
the output is:
[6] [0, 1, 2, 3]
on first step, check whether text[0..m] == pattern
. on second step, want check whether text[1..m+1] == pattern
. remove text[0]
hash (at moment multiplied precomputed h
): t = (t-h*ord(text[s]))%q
. , then, add text[m]
it: t = (t*d+ord(text[s+m]))%q
. on next step, remove text[1]
, add text[m+1]
, , on. t = (t+q)%q
line there because negative number modulo q
yields remainder in range (-q; 0]
, , want in range [0; q)
.
note want check total of n-m+1
substrings, not n-m
, hence correct loop for s in range(n-m+1)
. checked second example (finding "xx" in "xxxxx").
also worth noting:
the line
h = pow(d,m-1)%q
may slow ifm
large. better take result moduloq
after each ofm-2
multiplications.this algorithm still o(nm) in worst case.
text="a"*100000
,pattern="a"*50000
, find 50001 positions substring of text matches pattern, , check them character-by-character. if expect code work fast in such extreme cases, should skip character-by-character comparison , find way deal false positives (i.e., hashes equal strings not). picking large prime numberq
may reduce probability of false positive acceptable level.
Comments
Post a Comment