We were discussing the content based techniques in document filtering for indexing. The technique that is chosen for the learning is only one of the considerations in the overall design. In order to come up with the right classifier, we need to consider the following steps in sequence:
The first step is collecting the dataset. If an expert is available, then we may know the fields (attributes, features) which are most informative otherwise we try to measure everything in a brute-force method in order to shortlist the attribute, feature pair
The second step is the data preparation and the data preprocessing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. These methods not only handle noise but also cope with large data by working on smaller datasets.
Then the features are pruned by selecting a subset of features and is used to identify and remove as many irrelevant and redundant features as possible. Sometimes features depend on one another and unduly influence the classifier. This is solved by constructing new features from the basic feature set.
Then the training set is defined. One rule of thumb is to split the training set by using two thirds for training and the other third for estimating performance. Another technique is to do cross validation where the training set is mutually exclusive and equal sized and the classifier is trained on the union. Generally we have only one test dataset of size N and all estimates must be obtained from this sole dataset. A common method for comparing supervised ML algorithms involves comparing the accuracies of the classifiers statistically. Algorithm selection depends on the feedback from this evaluation. The feedback from the evaluation of the test set is also used to improve each of the steps described above.
Even the choice of distance metric within a classifier is fine tuned with feedback from the training set although some algorithms try not to be sensitive to the choice of the similarity functions. Euclidean distance is based on the sum of squares of the difference between the two instances Minkowsky replaces the power of two with the power of r where r is the dimensions of the normed vector. If the power was 1, the distance is Manhattan. Chebyshev goes by the maximum distance between the two instances. Camberra takes the ratio of the difference to the sum of the distances without the signs. Kendall's rank correlation takes the difference between the number of concordant pairs and the number of discordant pairs as found from the signs on the distance differences and as a fraction of the total number of pair combinations. If the instances are distributed with high cohesion, something simpler will suffice. If the marginals need to be included, we may need to make it more sophisticated.
#codingexercise
We were looking at an exercise involving sum of all numbers formed by subsequences in the digits representing a number
We did this by choosing different numbers but we can also put to use the fact the number of such subsequences is represented by repetitions. A digit will appear as many times as it is chosen in this combination. Therefore the sum is the accumulation of repetitions times the digit value.
Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2 + a0a1 + a1a2 + a0a2 + (a0a1a2 ) = 4(a0+a1+a2) where 4 = 1 + 2 + 1
For a0,a1,a2,a3 we have 8(a0+a1+a2+a3) and 8 = 1 +3 + 3 + 1
and these are infact the sum of the binomial coefficients.
Int GetSumOfAllSubsequences(List<int>A, int start, int end)
{
If (start>end) return 0;
If (start==end)return A[start];
Int sum = A.Sum();
int sumofallsubsequences= 0
For (int n = 1; n <=A.Count; n++)
sumofallsubsequences += sum * GetNChoosekDP(A.Count-1, n-1)
Return sumofallsubsequences;
}
Alternatively, we could also calculate is A.Sum() x Math.Pow(2, A.Count-1)
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}
The first step is collecting the dataset. If an expert is available, then we may know the fields (attributes, features) which are most informative otherwise we try to measure everything in a brute-force method in order to shortlist the attribute, feature pair
The second step is the data preparation and the data preprocessing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. These methods not only handle noise but also cope with large data by working on smaller datasets.
Then the features are pruned by selecting a subset of features and is used to identify and remove as many irrelevant and redundant features as possible. Sometimes features depend on one another and unduly influence the classifier. This is solved by constructing new features from the basic feature set.
Then the training set is defined. One rule of thumb is to split the training set by using two thirds for training and the other third for estimating performance. Another technique is to do cross validation where the training set is mutually exclusive and equal sized and the classifier is trained on the union. Generally we have only one test dataset of size N and all estimates must be obtained from this sole dataset. A common method for comparing supervised ML algorithms involves comparing the accuracies of the classifiers statistically. Algorithm selection depends on the feedback from this evaluation. The feedback from the evaluation of the test set is also used to improve each of the steps described above.
Even the choice of distance metric within a classifier is fine tuned with feedback from the training set although some algorithms try not to be sensitive to the choice of the similarity functions. Euclidean distance is based on the sum of squares of the difference between the two instances Minkowsky replaces the power of two with the power of r where r is the dimensions of the normed vector. If the power was 1, the distance is Manhattan. Chebyshev goes by the maximum distance between the two instances. Camberra takes the ratio of the difference to the sum of the distances without the signs. Kendall's rank correlation takes the difference between the number of concordant pairs and the number of discordant pairs as found from the signs on the distance differences and as a fraction of the total number of pair combinations. If the instances are distributed with high cohesion, something simpler will suffice. If the marginals need to be included, we may need to make it more sophisticated.
#codingexercise
We were looking at an exercise involving sum of all numbers formed by subsequences in the digits representing a number
We did this by choosing different numbers but we can also put to use the fact the number of such subsequences is represented by repetitions. A digit will appear as many times as it is chosen in this combination. Therefore the sum is the accumulation of repetitions times the digit value.
Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2 + a0a1 + a1a2 + a0a2 + (a0a1a2 ) = 4(a0+a1+a2) where 4 = 1 + 2 + 1
For a0,a1,a2,a3 we have 8(a0+a1+a2+a3) and 8 = 1 +3 + 3 + 1
and these are infact the sum of the binomial coefficients.
Int GetSumOfAllSubsequences(List<int>A, int start, int end)
{
If (start>end) return 0;
If (start==end)return A[start];
Int sum = A.Sum();
int sumofallsubsequences= 0
For (int n = 1; n <=A.Count; n++)
sumofallsubsequences += sum * GetNChoosekDP(A.Count-1, n-1)
Return sumofallsubsequences;
}
Alternatively, we could also calculate is A.Sum() x Math.Pow(2, A.Count-1)
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}
No comments:
Post a Comment