Sunday, March 7, 2021

 The choices in data mining algorithms:  

There are several data mining algorithms that can be applied to a given dataset. The choice of the data mining algorithm does not always become obvious. Some exploration of the data becomes necessary in this regard.  

If the use case was well articulated, the choice for the data mining algorithm becomes immediately clear. The use case becomes clear only when the data is well-known and the objective for the business purpose is known. Usually, only the latter is mentioned such as the prediction of an attribute associated with the data.  


The dataset may also not be suitable for supervised learning whether the labels are already given for some training data. Some techniques are required to determine the rules with which to assign labels to the raw data. If the rules were available for business purposes, then the assignment of labels is merely an automation task and helps prepare the training set for the data.  


In the absence of business rules to assign labels to the data, the dataset for data mining is usually large and cannot be compared by mere inspection. Some visualization tools are necessary. In this regard, two algorithms stand out for making this task easier. First, the decision tree algorithm can be used to find the relationships between the rows, and the visualization in the form of attributes that are significant to the outcome can be established. The tree can be pruned to see which attributes matter and which do not matter. The split of the nodes on each level helps visualize the relative strength of those attributes across rows. This is very helpful when the tree is generated without supervision.  


The other algorithm is the use of the Naive Bayes Classifier to assign data. This classifier is helpful to explore data, finding relationships between input columns and predictable columns, and then using the initial exploration to create additional algorithms. Since it compares across columns for a given row, it evaluates the binary probabilities for with and without that attribute in each column.  


Together these attributes can help with the initial exploration of data to choose the right algorithm for a given purpose. Usually, the split between training data and test data for the purpose of prediction, is 70% for training data and 30% for test data. The preprocessing and initial exploration even after extract-transform-load help prepare the training data. The better the training, the better the result.   

No comments:

Post a Comment