string - Python: Rabin-Karp algorithm hashing -
i implementing rabin-karp algorithm fun. came across pseudocode:
rabin -karp -matcher (t, p, d, q) 1 n = t.length 2 m = p.length 3 h = d^(m-1) mod q 4 p=0 5 t= 0 6 = 1 m / preprocessing / 7 p = (dp + p [i]) mod q 8 t = (dt + t [i]) mod q 9 s = 0 n-m / matching / 10 if p == t 11 if p [1... m] == t [s + 1...s + m] 12 print “pattern occurs shift” s 13 if s < n-m 14 t = (d(t-t[s + 1]h) + t [s + m + 1]) mod q i implemented in python 2.7 so:
def rabin_karp_matcher(text, pattern, d, q): n = len(text) m = len(pattern) h = pow(d,m-1)%q p = 0 t =0 result = [] in range(m): # preprocessing p = (d*p+ord(pattern[i]))%q t = (d*t+ord(text[i]))%q s in range(n-m): if p == t: # check character character match = true in range(m): if pattern[i] != text[s+i]: match = false break if match: result = result + [s] if s < n-m: t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here return result where result list containing index in text of pattern.
my code failing find 26 in 3141592653589793 , suspect has hashcode defined line 14 of pseudocode. can please this.
i passed in following paramters:
p = "26" t = "3141592653589793" d = 257 q = 11
p , t must strings/arrays of chars
here working version of code:
def rabin_karp_matcher(text, pattern, d, q): n = len(text) m = len(pattern) h = pow(d,m-1)%q p = 0 t = 0 result = [] in range(m): # preprocessing p = (d*p+ord(pattern[i]))%q t = (d*t+ord(text[i]))%q s in range(n-m+1): # note +1 if p == t: # check character character match = true in range(m): if pattern[i] != text[s+i]: match = false break if match: result = result + [s] if s < n-m: t = (t-h*ord(text[s]))%q # remove letter s t = (t*d+ord(text[s+m]))%q # add letter s+m t = (t+q)%q # make sure t >= 0 return result print (rabin_karp_matcher ("3141592653589793", "26", 257, 11)) print (rabin_karp_matcher ("xxxxx", "xx", 40999999, 999999937)) the output is:
[6] [0, 1, 2, 3] on first step, check whether text[0..m] == pattern. on second step, want check whether text[1..m+1] == pattern. remove text[0] hash (at moment multiplied precomputed h): t = (t-h*ord(text[s]))%q. , then, add text[m] it: t = (t*d+ord(text[s+m]))%q. on next step, remove text[1] , add text[m+1], , on. t = (t+q)%q line there because negative number modulo q yields remainder in range (-q; 0], , want in range [0; q).
note want check total of n-m+1 substrings, not n-m, hence correct loop for s in range(n-m+1). checked second example (finding "xx" in "xxxxx").
also worth noting:
the line
h = pow(d,m-1)%qmay slow ifmlarge. better take result moduloqafter each ofm-2multiplications.this algorithm still o(nm) in worst case.
text="a"*100000,pattern="a"*50000, find 50001 positions substring of text matches pattern, , check them character-by-character. if expect code work fast in such extreme cases, should skip character-by-character comparison , find way deal false positives (i.e., hashes equal strings not). picking large prime numberqmay reduce probability of false positive acceptable level.
Comments
Post a Comment