Data Mining
Applications use data mining to answer questions for Market
Analysis such as which items sell together? (Diaper and beer) Who buys iPhones?
(customer segmentation ). Should this credit card application be approved? etc.
The idea in data mining is to apply a data driven, inductive
and backward technique to identifying model. Get model from learning data and
check with test data, refine the model if there’s a mismatch (prediction,
reality. This is different from forward deductive methods in that those build
model first, then deduce conclusions and then match with data. If there’s a mismatch
between the model prediction and reality, the model would then be refined.
The history of Data Mining is that it was started in Statistics
and Machine Learning. It is often main memory based software and uses patterns
such as classification and clustering. Database researchers contributed to this
field by adding scalability to large databases i.e. with main memories. They
also introduced new patterns called association rules and integration with SQL,
data warehouse etc.
The conceptual data model involves tables for items
(instance-id, item-type-id, attribute1, attribute2, … ) and transactions ( xid,
subset of item-types, customer-id, timestamp ) and item-types (item-type-id,
type-name, … ).
With this data model, a family of patterns called
Associations, are established. They involve subsets of similar item-types such
as say customer buying item-type ITI also buys item-type IT2. Another example
is sequential associations where the customers buying item-type IT1 will buy
item-type
Another Pattern family is Categorization / Segmentation that
partition item-types into groups. Unsupervised clustering is one where number
of groups is not given as for example, market segmentation. Supervised
classification is one where there is optional table category (instance-id,
class label). To grant or decline a credit card application is an example of
supervised classification.
Here are a few more examples and their classification:
People buying book b1 at Amazon.com also bought book b2 =>
Association pattern
Smoking causes lung cancer => Sequential association
pattern
Apparels have t-shirts and pants => segmentation
Customers who have a credit rating above 3.0 and a balance
of over hundred dollars (supervised classification)
Associations have association rules. Let I be a set of
items, T be a set of transactions. Then an association A is defined as a subset
of I occurring together in T. Support (S1) is a fraction of T containing S1.
Let S1 and S2 be subsets of I, then association rule to associate S1 to S2 has
a support(S1->S2) defined as Support(S1 union S2) and a confidence
(S1->S2) = Support(S1 union S2)/ Support(S1)
There are variations to association rules which include the
following: hierarchical set of items, negative associations and continuous
attributes. Association rules can also be compared with statistical concepts such
as conditional probability and correlation.
Association rules apply to logical data model as well as
physical data model. Logical data model extend relational model and SQL. It
specifies patterns over a relational schema and one can specify data mining
request to extract. An example is the data-cube model for DWH where the pattern
meets certain criteria. Due to its popularity, there are common data models for
many families of patterns and syntax for specifying pattern. Statistical data
mining is supported in SQL.
Physical data model is used when efficient algorithms to
mine association rules are required. As
an example, given an item-set I, a transaction set T and thresholds, find association
rule S1->S2 with support above a given threshold. The naive algorithm would be to do an
exhaustive enumeration. The steps would include such things as generate
candidate subsets of I, scan T to compute their support, select subsets with
adequate support and the downside is complexity exponential (size(I)). A better
algorithm may be the “Apriori” algorithm. This method generates candidate in
order of size. It combines subsets which differ only in the last item which results in an efficient
enumeration. The idea is to prune the candidate families as soon as possible.
Support is monotonic on the size of candidates and pruning can be applied in
each iteration. Sorted candidate sets are
efficient for apriori method and a hash tree can be used for candidate set of different
sizes
The data mining such as association rules mining, clustering mining and decision tree mining, is always on the physical data model. The association rule, clustering and decision trees pertain to the logical data model.
No comments:
Post a Comment