Cluster computing: Data Mining

Data Mining

Applications use data mining to answer questions for Market Analysis such as which items sell together? (Diaper and beer) Who buys iPhones? (customer segmentation ). Should this credit card application be approved? etc.

The idea in data mining is to apply a data driven, inductive and backward technique to identifying model. Get model from learning data and check with test data, refine the model if there’s a mismatch (prediction, reality. This is different from forward deductive methods in that those build model first, then deduce conclusions and then match with data. If there’s a mismatch between the model prediction and reality, the model would then be refined.

The history of Data Mining is that it was started in Statistics and Machine Learning. It is often main memory based software and uses patterns such as classification and clustering. Database researchers contributed to this field by adding scalability to large databases i.e. with main memories. They also introduced new patterns called association rules and integration with SQL, data warehouse etc.

The conceptual data model involves tables for items (instance-id, item-type-id, attribute1, attribute2, … ) and transactions ( xid, subset of item-types, customer-id, timestamp ) and item-types (item-type-id, type-name, … ).

With this data model, a family of patterns called Associations, are established. They involve subsets of similar item-types such as say customer buying item-type ITI also buys item-type IT2. Another example is sequential associations where the customers buying item-type IT1 will buy item-type

Another Pattern family is Categorization / Segmentation that partition item-types into groups. Unsupervised clustering is one where number of groups is not given as for example, market segmentation. Supervised classification is one where there is optional table category (instance-id, class label). To grant or decline a credit card application is an example of supervised classification.

Here are a few more examples and their classification:

People buying book b1 at Amazon.com also bought book b2 => Association pattern

Smoking causes lung cancer => Sequential association pattern

Apparels have t-shirts and pants => segmentation

Customers who have a credit rating above 3.0 and a balance of over hundred dollars (supervised classification)

Associations have association rules. Let I be a set of items, T be a set of transactions. Then an association A is defined as a subset of I occurring together in T. Support (S1) is a fraction of T containing S1. Let S1 and S2 be subsets of I, then association rule to associate S1 to S2 has a support(S1->S2) defined as Support(S1 union S2) and a confidence (S1->S2) = Support(S1 union S2)/ Support(S1)

There are variations to association rules which include the following: hierarchical set of items, negative associations and continuous attributes. Association rules can also be compared with statistical concepts such as conditional probability and correlation.

Association rules apply to logical data model as well as physical data model. Logical data model extend relational model and SQL. It specifies patterns over a relational schema and one can specify data mining request to extract. An example is the data-cube model for DWH where the pattern meets certain criteria. Due to its popularity, there are common data models for many families of patterns and syntax for specifying pattern. Statistical data mining is supported in SQL.

Physical data model is used when efficient algorithms to mine association rules are required. As an example, given an item-set I, a transaction set T and thresholds, find association rule S1->S2 with support above a given threshold. The naive algorithm would be to do an exhaustive enumeration. The steps would include such things as generate candidate subsets of I, scan T to compute their support, select subsets with adequate support and the downside is complexity exponential (size(I)). A better algorithm may be the “Apriori” algorithm. This method generates candidate in order of size. It combines subsets which differ only in the last item which results in an efficient enumeration. The idea is to prune the candidate families as soon as possible. Support is monotonic on the size of candidates and pruning can be applied in each iteration. Sorted candidate sets are efficient for apriori method and a hash tree can be used for candidate set of different sizes

The data mining such as association rules mining, clustering mining and decision tree mining, is always on the physical data model. The association rule, clustering and decision trees pertain to the logical data model.

Cluster computing

Friday, January 25, 2013

Data Mining

No comments:

Post a Comment