Cluster computing

Saturday, March 27, 2021

The difference between a bloom filter and HyperLogLog:

A bloom filter is used when we must determine whether a member is definitely not in a set. It can return more than what should be in a set but this is tolerated when we need to know that the member is not part of the set. We will see how we can put this to use but let us first review this data structure. Members can be added to the dataset but not removed. Instead of having a large hash table that does not fit into memory, it works with just enough memory and an error-free hash that provides uniform random distribution. An empty bloom filter is a bit array of m bits that is all set to zero. There are k hash functions that hash some of the members to one of the m array positions. m is proportional to k. When a member is added, each of the k hash functions maps it to their corresponding positions, setting those bits to 1. In the test for whether a member is in a set, the k hash functions are re-evaluated. If any of the hash functions return 0, the member is definitely not in the set. If all the positions are set to 1, then either the member is part of the set or the bits have been set for some other member when it was inserted. The choice of error-free hash functions must be such that as m and k increase, there is little or no correlation between different bit-fields of such a hash. In some cases, a single hash function with k offsets to initial values suffices. In the large data sets, some examples of such hash functions include the enhanced double hashing or triple hashing functions which tend to use random numbers based on two or three hash-values. The removal of values is discouraged by the hashing functions because it is not easy to tell which positions are to be undone.

In key-value storage, bloom filters are helpful to determine whether a key exists. Values can exist on disk but the bloom filter itself can exist in memory.

A HyperLogLog is used to determine the number of distinct members in a multi-set. Determining the exact number of members in a multiset takes a lot of memory. Instead, this algorithm uses significantly less memory to determine an approximate number of members. A membership of more than a billion can be determined with just about a couple of Kilobyte of memory. This works by assigning a number to each of the members, converting the numbers to binary, and then determining the maximum number of leading zeros. If there are n bits to represent a number, then there are Math.pow(2,n) number of members in a superset of all members. A hash function can be applied to each member in the original multiset to get transformed and uniformly distributed random numbers with just as many of them as there were in the original multiset. There is a large variance possible in this estimation but it can be reduced by splitting the multiset into numerous subsets, estimating the number of leading zeros for each subset, and using a harmonic mean to combine these estimates. Variance measures how far a set of numbers is spread from its mean value and a harmonic mean is a special kind of mean that can be applied to rates.

Cluster computing

Saturday, March 27, 2021

No comments:

Post a Comment