Tuesday, March 4, 2014

In today's post we discuss RabinKarp algorithm for string matching based on mecr lectures. We match a string pattern P  of length m in a target string T of length n. The pattern matches O(mn) times when we go character by character in the target string. This can be improved with two good ideas:
one is that on a mismatch we try to skip ahead and we can skip upto m characters. So we have O(m) extra space and O(m + n)  worst case time. This is still an improvement.
second is that we do approximate matches where if there is a mismatch, we know for sure but if there is a match, it may or may not be so. This way we can skip ahead and definitive mis-matches.
We implement the approximation for instance with a hash match. We take strings at index 0 to n - m and hash m characters to see if the hash matches that of the pattern.
If a portion of the string matches, then their portion of hash should match. But we look at the opposite  way here. If the hash of the match is different then we know that the string is not the same as the pattern and we can skip ahead. So while we do a little bit more work in the hashing and consecutively our worst case is more than the first approach but in most cases our hash function will work very well. We just have a hash function that matches our needs which are:
1) we should pre-compute a hash of the pattern
2) we compute hash of string at index i to i+m-1for different i
3) the hash function should be such that
h(T[i, i+m-1]) = g(T[i]) "+" g(T[i+1]) "+" .... g(T[i+m-1])
where + allows us to compute
h(T[i+1, i+m]) = g(T[i+1]) + g(T[i +2)) + .... g(T[i+m])
with these properties it takes a constant time to compute
h(T[i+1, i+m]) from h(T[i, i+m-1])
In addition if this hash function has very few collisions, then we have a fast running time of O(n+m)
As an example, we can take a hash function h(string.letter) as 4h(string) + h(letter)
if a string is made up of DNA letters ACGT
h(ACT) = 4h(AC) + h(T) and so on.


No comments:

Post a Comment