Saturday, May 1, 2021

Data mining techniques

 

Introduction: Data mining techniques add insights to database that are typically not known from standard query operators and IT operations rely largely on some form of data store or CMDB for their inventory and associated operations. OLE DB standardized data mining language primitives and became an industry standard. Prior to OLE DB it was difficult to integrate data mining products. If one product was written using decision tree classifiers and another was written with support vectors and they do not have a common interface, then the application had to be rebuilt from scratch.  Furthermore, the data that these products analyzed was not always in a relational database which required data porting and transformation operations.
OLEDB for DM consolidates all these. It was designed to allow data mining client applications to consume data mining services from a wide variety of data mining software packages. Clients communicate with data mining providers via SQL.
The OLE DB for Data Mining stack uses a data mining extension (DMX), a SQL like data mining query language to talk to different DM Providers. DMX statements can be used to create, modify and work with different data mining models. DMX also contains several functions that can be used to retrieve statistical information.  Furthermore, the data and not just the interface is also unified. The OLE DB integrates the data mining providers from the data stores such as a Cube, a relational database, or miscellaneous other data source can be used to retrieve and display statistical information.
The three main operations performed are model creation, model training and model prediction and browsing.
 Model creation A data mining model object is created just like a relational table. The model has a few input columns and one or more predictable columns, and the name of the data mining algorithm to be used when the model is later trained by the data mining provider.
Model training: The data are loaded into the model and used to train it. The data mining provider uses the algorithm specified during the creation to search for patterns. These patterns are the model content.
Model prediction and browsing: A select statement is used to consult the data mining model content in order to make model predictions and browse statistics obtained by the model.
An example of a model can be seen with a nested table for customer id, gender, age and purchases. The purchases are associations between item_name and item_quantitiy. There are more than one purchases made by the customer. Models can be created with attribute types such as ordered, cyclical, sequence_time, probability, variance, stdev and support.
Model training involves loading the data into the model. The openrowset statement supports querying data from a data source through an OLE DB provider. The shape command enables loading of nested data.

When the data mining is not sufficient in terms of grouping, ranking and sorting, AI techniques are used. The Microsoft ML package provides:
fast linear for binary classification or linear regression
one class SVM for anomaly detection
fast trees for regression
fast forests for churn detection and building multiple trees
neural net for binary and multi-class classification
logistic regression for classifying sentiments from feedback

#codingexercise: Canonball.docx 

No comments:

Post a Comment