Cluster computing

Tuesday, April 16, 2013

Chapter 24 DB systems textbook

To build a decision tree:
Input : node n, parition D, split selection method S
Output: decision tree for D rooted at node n

Top-Down decision tree induction schema:
BuildTree(Node n, data partition D, split selection method S)
   Apply S to D to find the splitting criterion
   if ( a good splitting criterion is found)
       Create two children nodes n1 and n2 of n
       Partition D into D1 and D2
       BuildTree(n1, D1, S)
       BuildTree(n2, D2, S)
   endif

Clustering:
Input : node n, partition D, split selection method S
Output: decision tree for D rooted at node n

Top-Down decision Tree induction Schema:
BuildTree(Node n, data partition D, split selection method S)
(1a) Make a scan over D and construct the AttributeValue Classlabels (AVC) group of n in-memory.
(1b) Apply S to the AVC group to find the splitting criterion

A partitioning clustering algorithm partitions the data into k groups such that some criterion that evaluates the clustering quality is optimized. k is specified by the user. A hierarchical clustering algorithm generates a sequence of partitions of the records. Starting with a partition in which each cluster consists of one single record, the algorithm merges two partitions in each step until one single partition remains in the end.

Monday, April 15, 2013

keyword clustering

Wartena 08 describes topic detection by clustering methods. The paper suggests that the topic for the whole text can be found based on a probabilistic latent semantic analysis. Keywords are clustered based on statistical similarity measure. The conditional probability distributions that indicate the latent topic are the clusters that this approach finds. Clustering is based on a hierarchical top down method which requires that the center is recomputed in each step. Two elements are selected that have the largest distance and are used as seeds for the clustering. Next all other items are assigned to the center closest to one of the two seeds. After all the items are assigned, the centers are recomputed. The new centers serve as new seeds and the process is repeated until two centers are converged. If the clusters are larger than a threshold, the whole procedure is repeated on that cluster and thus we find a binary tree of clusters. The choice of distance functions used in this paper for two terms t and s are the cosine similarity of the document distribution, the cosine similarity of vector tf, idf values of keywords, the Jensen Shannon distributions between the document distributions and the Jensen Shannon distributions between the term distributions pt and ps.
When clustering using any distance function, the cluster centers could be chosen among the data points and then centers recomputed in each step.

Saturday, April 13, 2013

interview question answers continued

Q: How do you update the game of a black and white coin game such as Othello ?
A: In the update mode, find the longest sequence of the coin opposite of the color to be played. If this sequence has options to be bounded, place the coins at one of the ends. Choice of ends can be based on the side that has odd number of open spaces. The set of sequence can be based on the state of the game. and can generally be identified by expanding out from the center.

Q: Given the root of a binary search tree along with a node inside it, find the successor node for the given node.
A: The structure of a binary search tree lets us determine the successor of a node without ever comparing keys. If the right subtree of node x is non-empty, then the successor of x is the leftmost node of the right subtree.If the right subtree of node x is empty and x has a successor y, then y is the lowest ancestor of x whose left child is also an ancestor of x.

Q: Given a lot of intervals [ai, bi], you have to find a point with the intersection of most number of intervals.
A: We can maintain a count of inclusions in intervals for each point as we traverse the numberline from the minimum ai to the maximum bi.

Friday, April 12, 2013

interview question answers continued

Q: Give the four basics of object oriented Programming.
A: These are some of the tenets of object oriented programming
1. Encapsulation - define a concept by keeping the data and its operations together
2. Inheritance - defines an "is-a" relationship between derived and base class.
3. Composition - defines a "has-a" relationship between a container and contained class.
4. Polymorphism - defines the same method for different contexts

Q: Given two integers m, n, find the first integer r that is divisible by both m and n.
A: Find the factors of m and n. From the factors choose two or more factors such that the product is bigger than both m and n and less than or equal to m * n.

Q: List and compare the various sorting algorithms
A: 1. Bubble sort : make multiple pass and fix the order of elements. It has an average O(n^2) complexity/
2. Selection sort : finds the minimum and swaps with the first and repeat for consecutives. Another O(n^2) algorithm
3. Insertion sort : played like the cards in a deck, take elements one by one and insert them in position in the sorted list. Average O(n^2)
4. Heap sort : Use the array as a heap and use the heap algorithm to sort. O(nlogn) complexity
5. Merge sort : bottom up merge between pairs O(nlogn) complexity
6. Quick sort : divide and conquer based on partitioning on a pivot value. O(nlogn) complexity

Wednesday, April 10, 2013

interview question answers continued

Q: Given an integer array find if there exists three integers that add up to zero.
A: This is another integer array problem where we are interested in a particular sum. Since we are looking for three integers the naiive solution looks O(n^3) but this is not the case. First we could just sort the elements, then for each element we encounter we know the complement that makes the sum zero.For each complement, we know the fractions has to be earlier than the number we have just encountered. Thus we can proceed better in this case if the elements are sorted.

Q: Given a set of login and logout pairs of users, find the number of users online at any given point of time.
A: From the start of the list of logins and logouts, populate a dictionary of users and their active logins by incrementing a counter for every login of that user and decrement for every logout of that user until the last entry. This gives the number of active logins for each user.

Q: Another integer array is given where there are repeating zeros. Compact the array by erasing the zeros.
A: The problem hints on using the same storage to get the resultant array. So we can lean on multiple passes on the array if we want. Hence, we could make one pass to find the index of all occurances of zero. Then for each occurance, shift all the elements following it. We could shift more than one at the same time if the occurances are consecutive. Another improvement could be to shift only the elements upto the next occurance and keep the zeros consecutive at the next occurance.

Q: Given a set of non-overlapping integer range and an input integer, now design a data structure to store them and have operations of insert, delete and search
A: A double dimensional array seems sufficient for the operations where the integer ranges could traverse one or more of the jagged array elements. For each element we keep track of the end of a range and the start of another.

Q:How can you tell if a given string is of the form aZb where a is the reverse of b ?
A: If a and b can be empty strings almost all strings can satisfy that form otherwise compare start and end index elements and update the indexes on a match.

Monday, April 8, 2013

interview question answers continued

Q: In Java differentiate between final, finally and finalize ?
A: The keyword final in Java are used to denote that which cannot be inherited further. The finally is used for mandatory invocation after try catch exception handling and the finalize is used by the runtime for garbage collection of objects.

Q. Given two arrays, how will you find the intersection ?
A. If intersections is defined as the set of elements that appear in both arrays in no particular order, and both the arrays can be arbitrarily large or have elements from an arbitrary range, then one approach could be to sort both arrays and walk down both arrays element by element and put the common ones in a bag until one of the arrays gets over.

Q. Given two squares by their four corners coordinates, find the coordinates of their intersection.
A. Take the minimum and maximum of all x coordinates and the same for y coordinates. If these are of the same square, one is contained in the other. Otherwise they are either overlapping or disjoint, which you can tell from the diagnoally opposite ends of the min and max. If they do overlap, we already have the co-ordinates for two diaognally opposite corners and other two are common points to both squares.

Q. What is the difference between a hash table and a hash map.
A. A hash table stores values based on the hash computed from the value. A hash map stores key value pairs based on the hash of the key. In C++ a hash map is used where < is not defined or unsuitable for the intended key.   A good hash function is a pre-requisite for both or there can be collisions for some entries.   For example :
template<class T> size_t Hash<T>::operator() (const T& key) const
{
size_t ers = 0;
size_t len = sizeof(T);
const char* p = reinterpret_case<const char*> (&key);
while (len--) res = (res << 1) ^*p++;
return res;
}

Q: What is the significance of the term "dead beef"
A: Initialized memory is usually set to 0xEF bit pattern to indicate the object at this memory has not been used. Similarly 0xCC is used to denote uninitialized memory or no man's land.

Q: Write a C code to find the depth of a binary tree.
A: int GetDepth(Node* root)
     {
       if (root == null) return 0;
       int depth = 1;
       if (root->left == null && root->right == null) return depth;
       int subtree = GetDepth(root->left);
       if (subtree >= depth) depth = subtree+ 1;
       subtree = GetDepth(root->right);
       if(subtree >= depth) depth = subtree + 1;
       return depth;
     }

Sunday, April 7, 2013

interview question answers continued

Q: Given an array of integers ( both positive and negative in random order ), find the subarray with the largest sum.
A: This has a surprisingly simple solution which keeps track of a local and global sum. If the inclusion of the next number does not improve our sum so far, we reset our sum to this number and proceed for the length of the array.

Q: How are cookies passed in the HTTP protocol ?
A: The server sends a set-cookie:name=value to the client, in response to which the client sends a cookie: name=value in its request header.

Q: How will you search for a word in a very large database ?
A: There are builtin operators that lets you search for text such as Contains, like, and fulltext. If we don't know the table we need to look in, we can use full text search.

Q: Explain the functioning or working of a search engine like Google :
A: As explained in the anatomy of a search engine paper, here are some of the steps:
1. A URL server sends lists of URLs to be fetched to the crawlers.
2. Several distributed crawlers download web pages based on this list.
3. The web pages that are fetched are then sent to the store server.
4. The store server then compresses and stores the web pages in a repository.
5. Every web page has an associated ID number called a docID which is parsed whenever a new URL is parsed out of a web page
6. An indexer and sorter, reads the repository, uncompresses the documents and parses them to index the words occurrences called hits.The indexer distributes these words into a set of barrels. It also stores the links in an anchor file
7. The URL resolver reads the anchors file and converts relative URL into absolute URL and into docID. It also generates a database of URL pairs to compute page ranks.
8. The sorter takes the barrels sorted by docID and resorts them by wordID to generate the inverted index.The sorter also produces a list of wordIDs and offsets into the inverted index.
9. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher.
10. The searcher is run by a web server and uses the lexicon built by the DumpLexicon together with the inverted index and page ranks to answer queries.