Wednesday, November 27, 2013

There are some interesting properties of Pearsons' hashing function we brought up in the previous post. We will discuss some variations now:
1) When compared with the hashing that updates the hash in each step as h[I] = h[I-1] + c[I] mod table_size, the function does not separate anagrams and demonstrates a significant deviation from uniformity.
2) That said, the non-uniform distribution of the additive hashing function above does not necessarily confer a large performance penalty.
3) If the range of the input characters can be limited, a smaller auxiliary table may be used. i.e alphanumeric input
4) In some cases, the hash values may be required to be greater than the eight-bit representations To increase it to sixteen bit values, we could do the following:
apply h to the string, called the result H1, 
add 1 modulo 256 to the first character of the string,
apply h to the modified string to get H2,
concatenate H1 and H2 to get a  16-bit index.
Note that when this hashing is applied to the same spell-checker words as the original hash function there was marginal improvement collisions indicating that both perform just about the same.
5) Collision handling can be improved by using a succession of hash indices. The original hashing function is well suited for this. By repeatedly incrementing the first character of the input string, modulo 256, one causes the hash index returned by h to pass through all 256 possible hash index values since strings differing by one character cannot produce the same hashing value.
6) This hashing algorithm lends itself to perfect and minimal hashing. Recall that perfect hashing is one in which there are no collisions and minimal hashing is one in which the integers onto which the particular list of words mapped form a contiguous set. Minimal perfect hashing is useful when a predetermined set of high frequency words are expected as input and the resulting hash has to be used as an index into an array relating those words. We can use this hashing algorithm to take a set of 31 common English words and map them onto unique integers between 1 and 31, in alphabetical order by using a special case of the table T.
7) The advantages of Pearson's hashing function include the following:
there is no restriction on the length of the string
the length does not need to be computed beforehand
there is very little computation for each character of the input being hashed
similar string are not likely to collide
 The disadvantages are that the output value ranges for the hash are in powers of 2 and to get others is more complicated  and more auxiliary memory is required due to the table used.

No comments:

Post a Comment