Cluster computing

Sunday, September 7, 2014

Let us quickly visit the data-mining algorithms first we mentioned in the previous post :
The Microsoft Decision Tree algorithm can be used to predict both discrete and continuous attributes. For discrete attributes, the algorithm makes predictions based on the input column or columns in a dataset. For example, if eight out of ten customers who bought chips were kids, the algorithm decides that age is a factor for the split. Here the predictable column would be a plot of the chip buyers against their age and the mining model would make a decision split for high or low age. Keeping this predictable column and the pattern and statistics such as count of the data helps with subsequent query. For continuous variables, the algorithm uses linear regression to determine where a decision tree splits.
The Microsoft Naive Bayes Theorem algorithm uses the Bayesian techniques assuming the factors involved are independent. The algorithm calculates the probability of every state of each input column, given each possible state of the predictable column. For example, the age of the chips buyers is broken down into age group. Then for each possible outcome of high or low age groups, it calculates the probability distribution of those age groups. The patterns of the data and the probability, score and support are used for subsequent queries.
The Microsoft Clustering Algorithm identifies the relationships in a dataset and then generates a cluster. The model specified by the user identifies the relationship and so there is no predictable column required. Taking the example of chips buyers, we can see that the kids form a separate cluster than the others. Further splitting the clusters into age groups, yields smaller clusters.
The Microsoft Sequence Clustering algorithm is similar to clustering algorithm mentioned above but instead of finding groups based on similar attributes, it finds groups based on similar paths in a sequence. The sequence is a series of events or transitions between states in a dataset as in a Markov chain. Think of the sequences as IDs of any sortable data maintained in a separate table. The sequence for each data is analyzed to form groups.
The Microsoft Neural network algorithm combines each possible state of the input attribute with each possible state of the predictable attribute. The input attribute values and their probabilities are used to assign weights which then affect the outcome or predictable value. Generally this needs a large training data.
The Microsoft Association algorithm is an association algorithm that provides recommendations such as a market basket analysis. For example, if customers bought chips, then they also bought dips. The support parameter here is the number of cases that contain both chips and dips. The association rules are output to the algorithm
The Microsoft Time Series algorithm uses the historical information on the data to make predictions for the future. This can also be used for cross prediction where if we train the algorithm with two separate but related series, the resulting model can be used to predict one series based on another. To predict the time series, the method involves using a windowing transformation of the dataset into a series suitable for regression analysis where the past and the present are used in predictor and target variables respectively.
Let us look at the implementation for the time series algorithm next.
We will specifically look at ARTXP and ARIMA algorithms mentioned here
These are for short term and long term predictions respectively and they use decision trees. In mixed mode, both algorithms are used and there is a parameter to bias it to one of the algorithms with a 0 and to the other with a 1 and intermediary in between.

Cluster computing

Sunday, September 7, 2014

No comments:

Post a Comment