Cluster computing

Wednesday, March 29, 2017

We continue with the enumeration of analytical techniques using tags on datasets.
Cluster mining There are efficient algorithms to mine cluster centers. Given a table R(A1, A2, ..., An) find cluster with say center-points and minimize average-distance with the constraint on the number of clusters. The average distance is computed on the rows of R, nearest center-points. The center-points belong to R A naive algorithm is to use exhaustive enumeration. If the domains of the attributes are small, use the following steps : 1) Enumerate cross-products of domains of attributes 2) Enumerate size K subsets of tuple and 3) choose size-K subset with smallest average distance to R. An alternative idea is to use the K-mean algorithm which uses a greedy strategy - each iteration reduces average distance to R. 1. Select K points at random. 2. Assign tuples of R to the nearest point. 3. Compute center of each cluster. 4. Repeat steps 2 and 3 to reach local minima. The advantage of cluster based outliers is that it is faster than exhaustive but it is heuristic i..e there is no guarantee of solution quality and it may miss outliers. Outliers Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given. The choices for similarity measures between rows include distance functions such as Euclidean, manhattan, string-edits, graph-distance etc and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbours, density of neighborhood outside the expected range and the attribute differences with nearby neighbours. Outlier mining: A naive algorithm could compute dissimilarity measure for each row in R with the rest of R, then choose the top-K rows, or those with scores outside the expected range. An alternative idea is clustering based where we use the clusters as a summary of data to reduce computational costs. It involves the following steps 1. Cluster R eg via K-means, 2. Compute distance of each tuple in R to nearest cluster center and 3. choose top-K rows, or those with scores outside the expected range. The advantage of using the cluster based outliers is that it is faster but the disadvantages are the same as with clustering i.e. it is heuristic ( there is no guarantee of solution quality) and may miss outliers

#codingexercise
Yesterday we were look at the count of distinct subsequences where we computed the sum of the dp. We can also do it linearly as shown below:
int GetCountAlt(List<int> A)
{
var last = int[A.Count];
int n = A.Count;
int dp[n+1];
dp[0] = 1;
for (int i=1; i<=n; i++)
{
dp[i] = 2*dp[i-1];

if (last[A[i-1]] != -1)
dp[i] = dp[i] - dp[last[A[i-1]]];

last[A[i-1]] = (i-1);
}
return dp[n];

}

we can also write this function without memoization

such as follows:
/*

// untested
int GetCountH(List<int> A, int N, int k, ref Hashtable h)
{
assert(N == A.Count);
if N == 0 return 0;
if k == 0 return 1; // sorted
int count = 1;
for(int i = 0; i < k; i++)
{
count += 2 ×GetCountH(A, N, i, ref h);

if (h[A[i]] != -1)
count -= GetCountH(A, N, h[A[i]], ref h);

h[A[i-1]] = i;
}
return count;

}
*/

Cluster computing

Wednesday, March 29, 2017

No comments:

Post a Comment