Sunday, May 26, 2013

SQL Analysis services provides the ability to write mining models to make predictions or analyze your data. Mining model content comprises of the metadata about the model, statistics about the data, and patterns discovered by the mining algorithm. The content may include regression formulas, definition of rules and item sets or weights and other statistics depending on the algorithm used. The structure of the model content can be browsed with the Microsoft Generic Content Tree Viewer provided in SQL Server Data Tools
The content of each model is presented as a series of nodes. Nodes can contain count of cases, statistics, coefficients and formulas, definition of rules and lateral pointers and XML fragments representing the data. Nodes are arranged in a tree and display information based on the algorithm used. If a decision tree model is used, the model can contain multiple trees, all connected to the model root. If a neural network model is used, the model may contain one or more networks and a statistics node. There are around thirty different mining content node types.
The mining models can use a variety of algorithms and are classified as such. These can be association rule models, clustering models, decision tree models, linear regression models, logistic regression models, naïve Bayes models, neural network models, sequence clustering and time series models.
Queries run on these models can make predictions on new data by applying the model, getting a statistical summary of the data used for training, extracting patterns and rules, extracting regression formulas and other calculations, getting the cases that fit a pattern, retrieving details about the individual cases used in the model and retaining a model by adding new data or performing cross-prediction.
One specific mining model is the clustering model and is represented by a simple tree structure. It has a single parent node that represents the model and its metadata, and each parent node has a flat list of clusters. The nodes carry a count of the number of cases in the cluster and the distribution of values that distinguish this cluster from other clusters. For example, if we were to describe the distribution of customer demographics, the table for node distribution could have attribute names such as age and gender, attribute values such as number, male or female, support and probability for discrete value types and variance for continuous data types. Model content also gives information on the name of the database that has this model, the number of clusters in the model, the number of cases that support a given node and others. In clustering, there's no one predictable attribute in the model. Analysis services also provides a clustering algorithm and this is a segmentation algorithm. The cases in a data set are iterated and separated into clusters that contain similar characteristics. After defining clusters, the algorithm calculates how well the clusters represent  the groups of data points and then redefines the cluster to better represent the data. The clustering behavior can be tuned with parameters such as the maximum number of clusters or changing the amount of support required to create a cluster.
Data for clustering usually have a simple one key column, one or more input columns and other predictable columns. Analysis services also ships a Microsoft Cluster Viewer that shows the clusters in a diagram.
The model is generally trained on a set of data before it can be used to make predictions. Queries help to make predictions and to get descriptive information on the clusters.
Courtesy : msdn

No comments:

Post a Comment