Tuesday, November 26, 2013

Pearsons function for fast hashing of variable length text strings is an interesting hashing technique.
It proposes a hashing function that takes a word W as an input consisting of n characters C1, C2, ... Cn, each character being represented by one byte. The function returns an integer in the range 0 to 255. A table T of 256 randomish bytes used in the process for an indexed memory read and to perform an exclusive or operation against the input. The length of the string doesn't matter because the null terminator at the end of the string is used instead.
The steps for the hashing are described as follows:
h[0] = 0;
for( int i = 1; i <= N; i++)
  h[i] = T[ h[i-1] xor C[i] ];
return h[n];
There are two desirable properties of this algorithm which derive from cryptographic checksums and message authentication codes. First, it ensures that small changes to the data result in large and nearly random changes to the checksum. Second, the effect of changing one part of data must not be canceled by an easily computed change to some other part and this is preserved by this algorithm.
The downside is that it is length reducing the input and based on exclusive-OR of substrings. There is no undesirable effect of using a particular T or another as long as it is random. On the other hand, a manipulation of T to intentionally disperse the result for short similar strings turned out to be a poor choice.
h can be evaluated on the following:
h[n] is random
if two input strings differ by a single bit, they will never collide
h was applied to a spelling checker dictionary with 26,662 words and the resulting output values were not very uneven.
We may recall the textbook example for hashing:
template<class T> size_t Hash<T>::operator()(const T& key) const
{
   size_t res = 0;
   size_t len = sizeof(T);
   const char* p  = reinterpret_cast<const char*> (&key);
   while (len--) res = (res << 1 )  ^ *p++;
   return res;
}

No comments:

Post a Comment