Cluster computing

Monday, April 10, 2017

We were discussing the choice for learning algorithms, the metrics they use and their fine tuning with feedback from the sample set. There are a few steps followed in sequence to make the right choices.
The first step is collecting the dataset. If an expert is available, then we may know the fields (attributes, features) which are most informative otherwise we try to measure everything in a brute-force method in order to shortlist the attribute, feature pair.
The second step is the data preparation and the data preprocessing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. These methods not only handle noise but also cope with large data by working on smaller datasets.
Then the features are pruned by selecting a subset of features and is used to identify and remove as many irrelevant and redundant features as possible. Sometimes features depend on one another and unduly influence the classifier. This is solved by constructing new features from the basic feature set.
Then the training set is defined. One rule of thumb is to split the training set by using two thirds for training and the other third for estimating performance. Another technique is to do cross validation where the training set is mutually exclusive and equal sized and the classifier is trained on the union. Even the choice of distance metric within a classifier is fine tuned with feedback from the training set although some algorithms try not to be sensitive to the choice of the similarity functions. We discussed Euclidean, Minkowski, Manhattan, and Chebyshev, and Canberra.
#codingexercise
We were looking at an exercise involving sum of all numbers formed by subsequences in the digits representing a number but a similar technique is also suitable to count the number of repetitions.
We use the fact the number of repetitions is represented by combinations. A digit will appear as many times as it is chosen in this combination and therefore the count of repetitions is the count of binomial coefficients. Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2 + a0a1 + a1a2 + a0a2 + (a0a1a2 ) resulting in 1 + 2 + 1 = 4 repetitions
For a0,a1,a2,a3 we have 1 +3 + 3 + 1 = 8 repetitions
and these are infact the sum of the binomial coefficients.
Int GetRepetitionCount(List<int>A, int start, int end)
{
If (start>end) return 0;
If (start==end)return A[start];
Int sum = A.Sum();
double count= 0
For (int n = 1; n <=A.Count; n++)
count += GetNChoosekDP(A.Count-1, n-1)
Return count;
}
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}

if we were to count the number of subsequences, it would be the sum of the binomial coefficient of choosing k from n where k is the length of the subsequence and can vary in size from 1 to the length of the string.

Cluster computing

Monday, April 10, 2017

No comments:

Post a Comment