Cluster computing

Wednesday, May 21, 2014

Tonight we will look at a data structure called trie.
A trie is an ordered tree data structure that can store a dynamic set or an associative array where the keys are usually strings. We search by prefixes and hence they are referred to as prefix tree or radix tree. In a binary search tree, a node keeps the keys associated with that node. In the trie, this is not the case. What we use is the position of the node in the tree. This position is used to construct the prefix. All the descendants of a node have a common prefix of the string associated with that node and the root has an empty string. Values are normally not associated with every node, only with leaves and some inner nodes. This lends the trie to be stored in a space optimized manner which are referred to as compact prefix tree.
The term trie comes from retrieval. The use of prefix implies that we can use the trie as a deterministic finite automaton without loops. The transitions from one node to the other is forward only
While characters can be used to represent the keys, this is not always necessary. Instead they can be stored as bits in which case we have a bitwise trie. Fixed size of bits such as an integer number or memory address is convenient for storage.
A trie is very often used as a replacement for Hash tables. There are several advantages. First the lookup is faster because it's prefix based which means we traverse down the root node in O(m) time. The insertion and deletions are cheap.
There is no chaining and consequently no collisions.
This comes in very useful for applications such as auto complete and Spellchecker.
Tries can provide automatic ordering of keys which comes in useful for auto complete or predictive text. They also come in useful for approximation algorithms.
We could take an example of radix sort.
Tries and radix sort are similar to binary search trees and quick sort.
In particular, each node has an array of size r of child pointers. When we store words in a tree, we keep the letters on the edges and walk down the tree reading the words.
When we insert, we end each tree with a $ and in some case strings are substrings of others. To do a lookup, we just walk down the tree letter by letter and then see if the node has the ending $ to the word. If we get to a null pointer we know that the key is not there.
There are a couple of ways of making it more practical. In particular, some things we can do are :
compress paths that don't branch into a single edge
and only split when needed.These techniques optimize storage.
Radix sort comes in different flavors. For example, there's the most significant digit/character first and the least significant digit/character first radix sort.
In the most significant first radix sort, we sort by the most significant to get the buckets. Then we recursively sort each bucket, starting with the next most significant digit/character and so on.
In the least-significant first radix sort, we sort by the least significant digit / character to get the buckets. Then we concatenate them. This method is trickier to follow but its easy to implement. The reason we can go from least significant onwards is that at each stage (significance) of the sorting, we preserve (aka stable) the results from the previous iteration. we can prove that the progress of the algorithm is stable by mathematical induction.
void RadixSort(Array a, int digits)
{
for (int i = 1; i< =digits ; i++)
InsertionSort(A, i);
}
void InsertionSort(A, d)
{
for (int i = 1; i < n; i++)
{
int x = A[i];
// if (x < A[k]) Insert(A,k,x)
int j = i - 1;
while (j >= 0 && A[j] > x)
{ A[j+1] = A[j];
j = j - 1;
}
A[j + 1] = x;
}
}

Cluster computing

Wednesday, May 21, 2014

No comments:

Post a Comment