Cluster computing

Thursday, October 16, 2014

#codingexercise

Int getmatch (list <int> v, Int i)
{
If (v == null) return -1;
For ( Int k=0; k < v.length; k++)
If (v [k ] == i) return k;

Return -1;

}

In this article, we investigate Kernel methods in continuation of our readings on the use of different machine learning techniques in web applications. We can think of these as advanced classifiers in the set of classifiers that included decision trees, Bayesian classifiers and neural networks. A typical example of a dataset for which this kind of classifier is used is for matchmaking people as for a dating site. There are many variables, both numerical and nominal, and many non-linear relationships. One of the things we should realize is that a dataset and a classifier cannot be arbitrarily paired. We have to preprocess the data and select the best algorithm to get good results.

In the matchmaking example we want to build a classifier on, we recognize several different variables such as age, habits, interests, wants children, location etc. Instantly we see how complex this dataset is and how simplifying our assumption is for what we model the dataset to be. Let us say we write a function that can take the population and generate a dataset with values assigned against the variables. This could simply be reading a preprocessed data file. The next step is to take these variables in pairs and make a scatter plot to see how correlated they are. For example, age difference may be moretolerated as we grow older. If we build a decision tree classifier on this scatter plot, we will have split the data on numerical boundaries. This boundary that splits the data into different categories is called thedecision boundaries and it helps to visualize how divided the data is. Next, we can build a linear classifier that finds the average of all the data in each class, finds a center point and then classifies the new points to the nearest center point. By drawing a line through the averages in the scatter plot, we can take new data for matching based on this variable and predict the match based on how close they are to the nearest average. While closeness of a point to the average was measured with Euclidean distance earlier, we could now improve it with vectors and dot-product. The dot-product is a scalar quantity and can be positive or negative based on the cosine of the angle between the vectors. And with this way to find the closeness, we now have a linear classifier.

However, data may not always indicate a dividing line for a linear classifier to work effectively. In such a case, we use Kernel methods to improve the classifier to do non-linear classification.

But before we jump into the kernel methods let us quickly see how to work with variables that are not necessarily as simple as age to use a scatter plot. For example, interests could be diverse and we may end up creating a new variable for every interest. To overcome this, we represent the interests in a hierarchy and then use the structure to compute the closeness of interests. Together with a higher number of close interests, we can have a much smaller variable for this dataset. Similarly location could be represented by many different things such as street, city, state, country etc. To normalize location, we could use geocoding API that translates locations into latitudes and longitudes and then computes the distances as miles. Lastly we can scale the dataset for higher or lower resolution to enable applying alinear classifier.

Nevertheless, we may have a complex dataset that is not subject to a linear classifier. For example, if the data were divided by a circle forming two different regions with a concentric average, it is not east to apply a linear classifier because there is no line. Instead we use kernel methods. For the same example of concentric regions in a scatter plot of X and Y, we could take the squares of the corresponding variables and a scatter plot of the squares shows the inner region data points moving to a corner and all the others as away from the corner. In such a case, it’s easy to draw a dividing line separating the corner from the rest. Transforming the data for a couple of variables is easy. But transforming the data for vectors with a large number of variables or dimensions is not easy. Instead we apply what is called the kernel trick which involves replacing the dot product function with a new function that returns what thedot-product would have been if the data had first been transformed to a higher dimensional space using a mapping function. One such function is the radial-basis function which is like a dot product in that it takes the two vectors and returns a value but it also takes in an additional parameter which can be adjusted to get the best linear separation. And instead of calculating the dot-product of the new point to classify and the average point, we now find the radial-basis between the point and every other point in the class and average them. The weight adjustment to help with categorization called the offset is alsodifferent in the transformed space. It is usually pre-calculate for this dataset and passed in when classifying new points. This has now tremendously improved the linear classifier for say the impact of age difference in match making.

#codingexercise

Wednesday, October 15, 2014

#codingexercise:
size_t strlen (char* pc)
{
If (!pc) return -1;
size_t count = 0;
while (*pc) {
pc++;
count++;
}
return count;
}

@codingexercise
Int strcmp (char* a, char* b)
{
If (a ==null || b == null) return -1;
While (*a && *b && *a == *b )
{
a++;
b++;
}
Return (*a-*b);
}

Tuesday, October 14, 2014

#codingexercise
A linked list is given such that each node additionally points to a random other node in the list. Perform a deep copy.
Node* deepCopy(node* source)
{
if (source == NULL) return NULL;
Node* clone = deepCopy(source);
Node* curSrc = source;
Node* curDest = clone;
while (curSrc)
{
if (curSrc->other == NULL) {curDest->other = NULL; continue;}
int offset = relativePos(curSrc, curSrc->other, source);
if (offset <0)
{
Node* dest = clone;
for (int i = offset+1; i < 0 && dest; i = i + 1)
dest = dest->next;
curDest->other = dest;
}
else
{
Node* dest = curDest;
for (int i =0; i< offset && dest; i++)
dest = dest->next;
curDest->other = dest;
}
curSrc = curSrc->next;
curDest = curDest->next;
}
return clone;
}

Node* shallowCopy (Node* source)
{
If (!source) return NULL;
Node* ret = new Node ();
If (!ret) throw out_of_memory;
Ret->next = shallowCopy (source->next);
Return ret;
}

Int relativePos (node* source, node* target, node* head)
{
If (Source == target) Return 0;
Int I = 1;
while ( source && source->next && source->next != target)
{
Source = source->next;
I++;
}
If (source->next == target) return i;
I = -1;
if (head == target) return i;
While (head && head->next && head->next != target)
{
Head = head->next;
I =i-1;
}
Return i;
}

Today we look at inverted lists. In Apache Lucene index, a document consists of multiple fields and a field consists of multiple terms. The indexing of the document is controlled by labeling thefields as whether it needs to be analyzed (preprocessed), indexed, or stored (ready to be retrieved). The inverted index has a list of fields, their labels and offsets within the postings list. The entry in the postings list corresponding to the field has a term and frequency and positions in the frequency list. The postion in the frequency list gives the documents sorted by their Id that contain this term and each with offsets into the position in terms of say word count where the term appears. The whole index also called the vocabulary is sorted first by document then by fields and then by terms. Thus it’s a hierarchy of lists.

We now look at document retrieval step. This is the second step of the two listed earlier. Here we are interested in finding the top ranking matches of documents. Each document is checked for a match and retrieved one at a time. The documents are found from the postings list of the terms in the query. The postings list can be traversed concurrently. The TF-IDF score of the document is calculated to determine a match for the document. Apache Lucene calculates a static and a dynamic score which are based on the document and the dot product of the query and document vectors respectively. It also has a boost factor for the term frequency. The reason we bring up the static score is that the documents could additionally be indexed based on static score. However, we will just use the notion of a score to rank the documents by inserting into a priority queue data structure.

For large collections, the index can be distributed across multiple machines. The indexes are partitioned based on term partitioning and document partitioning. Usually these are determined outside thesystem because the caller will know how to partition. In ElasticSearch for example, the machines are organized by rows and columns. Each column represents a partition of documents while each rowrepresents a replica of the corpus.

Note that for the score we use, the IDF is calculated based on the whole corpus not just the partitions. In this case, either we treat the partitions as uniform fractions of the corpus or we poll and collect thescores from each.

#codingexercise
Implement strtok
Char* strtok (char* s, const char* delim)
{
Static Char* last;
Int c, sc;
If ( s == NULL && (s = last) == NULL))
{
Return NULL;
}
cont:
C = *sc++;
For ( spanp = (char*) delim, (sc = *spanp++) != 0;)
If ( c == sc) goto cont;
}
If ( c == 0){
Last = Null:
Return NULL;
}
Tok = s - 1;
For (;;){
C = *sc++;
SC = (char*) delim;

For ( spanp = (char*) delim, (sc = *spanp++) != 0;)
{
If (sc == c)
{
If ( c == 0)
s = Null;
Else
s [-1] == 0;
Last = s;
Return (tok);
}

}

}

// unreached

}

Courtesy: Apple

My back is hurting for some time now

Monday, October 13, 2014

#codingexercise
Int memcmp (char* src, char* dest, size_t n)
{
If (!src || !dest) return 0;
If (src == dest) return 1;
While (n)
{
If (*src != *dest) return 0;
n--;
}

Return 1;
}