Sunday, March 28, 2021

Hashes

 A play on hash

The idea behind hashing is that an entity’s representation can be mapped to a value in a set with the nice side effect that those values have a uniform random distribution even when the entity’s representations do not. This allows us to count the number of distinct elements by finding prefixes of leading zeros in a binary representation of length sufficient to capture a number of possible values. For example, if k is the length of the longest sequence of leading zeros in a binary representation of length n, then Math.pow(10, k) represents the number of unique elements because on average k zeros will occur once every those many elements. We size the count of distinct elements with the help of the prefix. We ignore the bias and the outliers because they are known and can be corrected by a harmonic mean and constant correction factor respectively.

If we use m hash functions generating m hashes for the same entity’s representation and each map to a bit position starting from an all zero-bit array, then the presence of even a single zero at the hashing of an entity’s representation immediately discounts that entity as a member of the set.

If we could go a step further to find all bit positions where the hashes come out to be zero, then we can identify at least one unique hash in Math.pow(2,m) hashes, where the corresponding element is not part of the set by inverting those positions to 1 bit and keeping the others as 0 bit.


No comments:

Post a Comment