Cluster computing

Tuesday, September 3, 2013

In this post, I'm going to talk about hash tables. Hash tables are popular because it takes a constant time to lookup a data record. A hash function is used to generate a fixed-length representation of the actual data record. Hash functions can make a table lookup faster and can detect duplicated or similar records. Two objects are similar if their hash codes are equal. Hashes don't retain the original data.
In a hash table, the hash function maps the search key to an index, which then gives the place where the data is inserted. Here the index only refers to a bucket since the range of the key values is typically larger than the range of hashes. Each bucket corresponds to a set of records.
Duplicate records are found by going through the set of records in the same bucket. This scanning is necessary because hashing ensures that same records end up in the same bucket. This is called collisions. When collisions are minimum, the hashing function has good performance.
Incidentally, geometric hashing can be used to find points that are close in a plane or three dimensional space. Here the hashing function is interpreted as a partition of that space into a grid of cells. The return value of the function is typically a tuple with two or more indices such as the dimensions of a plane.
Universal hashing is a scheme in which there are a family of hashing functions to choose from and a function is chosen such that the when two distinct keys are hashed, they would collide only once in n where n is the different hash values desired. However, it could have more collisions than a special purpose hash function.
Hash tables are also used for encryption because it gives a pretty good representation of the data and can guard against tampering of the data. If the data were to be modified, it would be hard to hide it in the same hash.

In cryptographic hash functions such as SHA-1, there is more even mapping of inputs across the entire range of hash values. Therefore, they serve as good general purpose hash functions. Note that this function is for a uniform spread and not random. A randomizing function is a better choice of a hashing function.

Hash functions have to be deterministic. If they are given the same keys, they should produce the same hash again. This does not mean hashing functions cannot be used with things that change such as the time of the day or a memory address. Whenever a key changes, it can generally be rehashed.

Cluster computing

Tuesday, September 3, 2013

No comments:

Post a Comment