Saturday, January 26, 2013

Data Mining (Continued)


Decision Tree

A useful data structure for mining the logical data model is the decision tree. Structure involves interior nodes = set (A1, … An) of categorical attributes . The leaf is the class label from domain(C). The edge is a value from domain(Ai), Ai associated with parent node. The property is a search tree. The tuples in R -> leafs in class labels . The decision tree's property is that it associates the tuples in R to the leafs i.e. class labels. The advantage of using a decision tree is that it can work with heterogenous data and the decision boundary are parallel to the axis.

Decision Tree Mining
There are efficient algorithms to mine decision trees. Given a table R (A1, A2, ... An, C), find a decision tree to minimize classification error. The constraint is to prefer a smaller tree and no unique columns.
One algorithm is the naive exhaustive enumeration. It involves the following steps. 1. Generate candidate trees. 2. Scan R to compute their classification error. 3. Select a tree with small classification error and small size. The downside is its complexity is exponential (number of columns of R).
A better algorithm is ID3. The ID3 is a greedy algorithm where each node reduces entropy. It involves the following steps 1. Select an attribute for root node. Choose attributes with highest information gain Partition tuples over domain of chosen attributes. 2. Use recursion on each child. 3. The stopping criteria is don't recurse on nodes with few samples. The advantages of using ID3 is that it is faster than exhaustive but the disadvantage is that it is heuristic that is there is no guarantee of solution quality.

Clustering

Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant.

The choices for similarity measures include distance functions such as euclidean, manhattan, string edits,  graph-distance etc. and with L2 metrics.

The choices for group representations are made with finding center such as with  mean, medoid, (mean and standard deviation) or with boolean expression on attributes or with named subsets of rows.

Cluster mining

There are efficient algorithms to mine cluster centers. Given a table R(A1, A2, ..., An) find cluster with say center-points and minimize average-distance with the constraint on the number of clusters. The average distance is computed on the rows of R, nearest center-points. The center-points belong to R

A naive algorithm is to use exhaustive enumeration. If the domains of the attributes are small, use the following steps : 1) Enumerate cross-products of domains of attributes 2) Enumerate size K subsets of tuple and 3) choose size-K subset with smallest average distance to R.

An alternative idea is to use the K-mean algorithm which uses a greedy strategy - each iteration reduces average distance to R. 1. Select K points at random. 2. Assign tuples of R to the nearest point. 3. Compute center of each cluster. 4. Repeat steps 2 and 3 to reach local minima. The advantage of cluster based outliers is that it is faster than exhaustive but it is heuristic i..e there is no guarantee of solution quality and it may miss outliers.

Outliers

Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given. 
The choices for similarity measures between rows include distance functions such as Euclidean, manhattan, string-edits, graph-distance etc and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbours, density of neighborhood outside the expected range and the attribute differences with nearby neighbours.

Outlier mining:

A naive algorithm could compute dissimilarity measure for each row in R with the rest of R, then choose the top-K rows, or those with scores outside the expected range.

An alternative idea is clustering based where we use the clusters as a summary of data to reduce computational costs. It involves the following steps 1. Cluster  R eg via K-means, 2.  Compute distance of each tuple in R to nearest cluster center  and 3. choose top-K rows, or those with scores outside the expected range.

The advantage of using the cluster based outliers  is that it is faster but the disadvantages are the same as with clustering i.e. it is heuristic ( there is no guarantee of solution quality) and may miss outliers.


 

No comments:

Post a Comment