Cluster computing

There are three types of joins : Nested Loop Joins, Hash Joins and Sort-Merge joins.
In a nested loop join, one of the tables is scanned and for every row, a look up is performed on the second table and the matching rows are retrieved. The tables can be labeled as outer and inner tables. The inner table lookup can be fast based on the presence of indexes. Let us take two tables employees and contractors, employees has 100,000 rows and contractors are 40,000 rows.
Then a table scan on the contractors together with a clustered index seek on employees is an example of a nested join.
The scan count of the employees table is 100,000 and that of the contractors is 1 for a join on id.
In a hash join, a hash table is created in memory for one of the inputs called the build input and searched for matching values from the other input, the probe input. If the build input does not fit in memory, it is partitioned recursively until it fits in memory. The probe input is hashed and used for matches. Hash joins are better when there is a significant difference between the size of the tables.
In a hash join of the example tables, the contractors table is used as the build input and the employees table is used with a clustered index scan.
The scan count of the employees table is 1 and that of the contractors is also 1.
In a sort merge join, the tuples of one input is examined, then the second sorted input is then scanned until a row with a match is found and then this is repeated for every row of the first input. Here the larger input dominates the cost. In a merge join, the inputs are sorted on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.
This type of join requires both inputs to be in memory as opposed to hash join, where only a smaller input is in memory. The merge join is more efficient if the inputs are already sorted. Note that inequality joins are not handled by this method.
For the given example, the sorting is done with the ids of the contractor table and it is scanned together with clustered index scan of the employees table.
The scan count for the employees table is 1 and that for the contractor table is also 1.

Cluster computing

Monday, January 20, 2014

No comments:

Post a Comment