Cluster computing

Saturday, October 19, 2013

This post is about the Microsoft basic data mining tutorial on MSDN. The goal is to create a mining database, set up several data mining models, and use the models to make predictions. This feature is provided as part of analysis services and I wanted to try out the Microsoft SQL Server 2012 Data Mining Add-Ins together with Excel but turns out that I'm running into an installation bug with Microsoft SQL Server 2012 Data Mining Add-In that keeps asking for a database permissions for my login even when its all there. It doesn't occur with any other database connections. Nevertheless, I will first lay out the interpretations from the text and then attempt to workaround the bug. Meanwhile, back to the data mining tutorial.
The tutorial teaches us how to create and work with different type of data mining models. It also teaches us how to create a copy of the mining model, and apply a filter to the mining model. After the model is complete, we can use it to drill-through results.
At the time of creating the model, we can split the data into training and test sets. This improves accuracy.
The model filters are filters that can be applied on both training and test sets.
The drill-through results follow from the pattern identified from the mining model which translates to actions on the data source.
The first step in these is to use the Analysis Services Multidimensional and Data-Mining Project templates. We can change the instance where the data mining objects are stored.
The data source can be specified using the connection manager and to select the native OLE DB\SQL Server native client. The tables and Views can then be selected from the source database.
The second step is to build a targeted mining model structure. This can be done by selecting the definition method from existing relational database or data warehouse and then choosing say the Microsoft Decision trees as the data mining structure. Table types and training data will then need to be specified. Specifying the data type and the content type will be next. There is a wizard option that automatically detects these but they may need to be reviewed.
The test data set is then carved out from the sample. The default value is thirty percent and this could be retained as-is.
Adding new models is easy and can be done by switching to mining models tab in SSDT and right-clicking the structure column and selecting new models. Naive Bayes and clustering model can be similarly added.
Each of the models can then be explored. For example in the decision tree, we can view all the nodes.
The models can the be tested using lift charts. A lift chart helps compare how well each of the model makes predictions and compare the result of each model directly against the results of other models.
After the accuracy has been found satisfactory, the prediction query builder can then be used to design and run prediction queries.

Cluster computing

Saturday, October 19, 2013

No comments:

Post a Comment