Cluster computing

Wednesday, April 7, 2021

Applications of Data Mining to Reward points collection service

Monitoring and pipeline contribute significantly towards streamlining the process and answering questions such as why did the model predict this? When was it trained? Who deployed it? Which release was it deployed in? At what time was the production system updated? What were the changes in the predictions? What did the key performance indicators show after the update? Public cloud services have enabled both ML pipeline and their monitoring. The steps involved in creating a pipeline usually involves configuring a workspace and creating a datastore, downloading and storing sample data, registering, and using objects for transferring intermediate data between pipeline steps, downloading, and registering the model, creating, and attaching the remote computer target, writing a processing script, building the pipeline by setting up the environment and stack necessary to execute the script that is run in this pipeline, creating the configuration to wrap the script, creating the pipeline step with the above-mentioned environment, resource, input and output data, and reference to the script, and submitting the pipeline. Many of these steps are easily automated with the help of built-in objects published by the public cloud services to build and run such a pipeline. A pipeline is a reusable object and one can that can be invoked over the wire with a web request.

Machine learning services collect the same kinds of monitoring data as the other public cloud resources. These logs, metrics, and events can then be collected, routed, and analyzed to tune the machine learning model.

Other than the platform metrics to help monitor and troubleshoot issues with the production deployment of machine learning systems, the model itself may have performance and quality metrics that can be used to evaluate and tune it. These metrics and key performance indicators can be domain-specific such as accuracy which is the ratio of the number of correct predictions to the number of total predictions, confusion matrix of positive and negative predictions for all the class labels in classification, Area under the Receiver Operating Characteristic ROC curve and the area under the ROC curve (AUC), F1 Score using precision and recall, entropy loss, mean squared error and mean absolute error. These steps for the post-processing of predictions are just as important as the data preparation steps for a good performing model.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Tuesday, April 6, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

In the previous section, we discussed the training data and the deployment of the trained model. This does not complete the production system. On the contrary, it is just the beginning of the lifecycle for that model. Over the time that the model is used for prediction, it’s accuracy or predictive power may deteriorate. This occurs due to one of the following three categories of changes: changes in the concept, changes to the data and changes in the upstream systems. The first reflects changes to the assumptions made when building the model. As with all business requirements, they may change over time and the assumptions made earlier may not hold true or they might need to be improved. For example, the fraud detection model may have encapsulated a set of policies that might need to be changed or the statistical model may have made assumptions on the prediction variable that might need to be redefined. The second type of deterioration comes from differences in training and test data. Usually, a 70/30 percentage split allows us to find and overcome all the eccentricities in the data, but the so-called test data is real world data that arrives continuously unlike the frozen training data. It might change over time or show preferences and variations that were not known earlier. Such change requires the model to be tuned. Lastly, the upstream data changes can be operational changes that change the data quality and consequently impact the model. These changes and the deterioration caused to the model are collectively called the drift and the ways to overcome the drift include ways to measure and actively improve the model. The metrics are called Model performance metrics and model quality metrics.

Monitoring and pipeline contribute significantly towards streamlining the process and answering questions such as why did the model predict this? When was it trained? Who deployed it? Which release was it deployed in? At what time was the production system updated? What were the changes in the predictions? What did the key performance indicators show after the update? Public cloud services have enabled both ML pipeline and their monitoring. The steps involved in creating a pipeline usually involves configuring a workspace and creating a datastore, downloading and storing sample data, registering, and using objects for transferring intermediate data between pipeline steps, downloading, and registering the model, creating, and attaching the remote computer target, writing a processing script, building the pipeline by setting up the environment and stack necessary to execute the script that is run in this pipeline, creating the configuration to wrap the script, creating the pipeline step with the above mentioned environment, resource, input and output data, and reference to the script, and submitting the pipeline. Many of these steps are easily automated with the help of builtin objects published by the public cloud services to build and run such a pipeline. A pipeline is a reusable object and one can that can be invoked over the wire with a web-request.

Machine learning services collect the same kinds of monitoring data as the other public cloud resources. These logs, metrics and events can then be collected, routed, and analyzed to tune the machine learning model.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Monday, April 5, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

One of the highlights of the machine learning deployments, as opposed to the deployment of data mining models, is that the model can be built and tuned in one place and run anywhere else. The client-friendly version of TensorFlow allows the model to run on clients with little resource as mobile devices. The environment for model building usually supports GPU. This works well to create a production pipeline where the data can be sent to the model-independent of where the training data was kept and used to train the model. Since the training data flows into the training environment, its pipeline is internal. The test data can be sent to the model wherever it is hosted over the wire as web requests. The model can be run on containers on the server-side or even in the browser on the client-side. Another highlight of the difference between ML environment pipeline and the data mining pipeline is the heterogeneous mix of technologies and products on the ML side as opposed to the homogeneous relational database-based stack on the data mining side. For example, logs, streams, and events may be streamed into the production pipeline via Apache Kafka, processed using Apache Flink, the kernels built with SciKit, Keras, or Spark-ML and the trained models run on containers taking and responding to web requests.

In the previous section, we discussed the training data and the deployment of the trained model. This does not complete the production system. On the contrary, it is just the beginning of the lifecycle for that model. Over the time that the model is used for prediction, its accuracy or predictive power may deteriorate. This occurs due to one of the following three categories of changes: changes in the concept, changes to the data, and changes in the upstream systems. The first reflects changes to the assumptions made when building the model. As with all business requirements, they may change over time and the assumptions made earlier may not hold true or they might need to be improved. For example, the fraud detection model may have encapsulated a set of policies that might need to be changed or the statistical model may have made assumptions on the prediction variable that might need to be redefined. The second type of deterioration comes from differences in training and test data. Usually, a 70/30 percentage split allows us to find and overcome all the eccentricities in the data but the so-called test data is real-world data that arrives continuously unlike the frozen training data. It might change over time or show preferences and variations that were not known earlier. Such change requires the model to be tuned. Lastly, the upstream data changes can be operational changes that change the data quality and consequently impact the model. These changes and the deterioration caused to the model are collectively called the drift and the ways to overcome the drift include ways to measure and actively improve the model. The metrics are called Model performance metrics and model quality metrics.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Sunday, April 4, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Neural network can be applied in layers and they can be combined with regressors so the technique can be used for a variety of use cases. There are four different types of neural networks. The fully connected layer which connects every neuron in one layer to every neuron in another layer. This is great for rigorous encoding, but it becomes expensive for large inputs and scalability. The convolutional layer is mostly used as a filter that brings out salient features from the input set. The filter sometimes called kernel is represented by a set of n-dimensional weights and describe the probabilities that a given pattern of input values represents a feature. A deconvolutional layer comes from a transposed convolutional process where the data is enhanced to increase resolution or to transform. A recurrent layer includes a looping capability such that its input consists of both the data to analyze as well as the output from a previous calculation performed by that layer. This is helpful to maintain state across iterations and for transforming one sequence to another.

The choices to apply machine learning techniques is dependent both on the applicability of the algorithm as well as the data. For example, we use a Convolutional Neural Network when we want to perform only classification. We use a Recurrent Neural Network when we want to retain state between encodings such as with sequences. We use classifier and regressor when we want to detect objects and their bounding box. The choices also vary with the data. CNN works great with Tensors which are distinct and independent from one another. The output using Tensors for a K-Nearest neighbors consists of a label with the most confidence which is a statistical parameter based on the support for the label, a class index, and a score set for the confidence associated with each label. Scalar data works very well for matrix and matrix operations. RNN works well with sequence of inputs.

One of the highlights of the machine learning deployments as opposed to the deployment of data mining models is that the model can be built and tuned in one place and run anywhere else. The client friendly version of TensorFlow allows the model to run on clients with little resource as mobile devices. The environment for model building usually supports GPU. This works well to create a production pipeline where the data can be sent to the model independent of where the training data was kept and used to train the model. Since the training data flows into the training environment, its pipeline is internal. The test data can be sent to the model wherever it is hosted over the wire as web requests. The model can be run on containers on the server side or even in the browser on the client side. Another highlight of the difference between ML environment pipeline and the data mining pipeline is the heterogeneous mix of technologies and products on the ML side as opposed to the homogeneous relational database-based stack on the data mining side. For example, logs, streams and events may be streamed into the production pipeline via Apache Kafka, processed using Apache Flink, the kernels built with SciKit, Keras or Spark-ML and the trained models run on containers taking and responding to web-requests.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Saturday, April 3, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Machine learning techniques form an altogether separate category of their own. The traditional data mining methods used clustering and statistics which are relevant to machine learning, but we did not include the neural networks with data mining, and we call it out with others in this category. Machine learning is very helpful to inform users about their activities that generate the most appreciation and the changing of these activities depending on the audience. It can also detect fraud in the employee appreciations which may be of interest to employers. For example, Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud than earlier. Discovering groups, searching and ranking are a few more examples.

Regions of interest are used to determine space and time focus on appreciation activity. This is helpful to detect events that would have otherwise gone unnoticed as a flurry of activities on the reward points table. Together with the classifier and this regressor, the latent events can be detected thus eliminating the need to hold formal events and determine winners.

One of the aspects of using neural networks with employee appreciation data is that the management can gain insights that would not otherwise have been possible with formal interactions. By classifying reward points based on vector features and using softmax classification, the neural networks can detect the hidden appreciation. Each neuron assigns a weight usually based on probability for each feature and the weights are normalized across resulting in a weighted matrix that articulates the underlying model in the training dataset. Then it can be used with a test data set to predict the outcome probability. Neurons are organized in layers and each layer is independent of the other and can be stacked so they take the output of one as the input to the other This is a technique that has found applications in a variety of domains starting from natural language processing.

Neural networks can be applied in layers and they can be combined with regressors so the technique can be used for a variety of use cases. There are four different types of neural networks. The fully connected layer connects every neuron in one layer to every neuron in another layer. This is great for rigorous encoding, but it becomes expensive for large inputs and scalability. The convolutional layer is mostly used as a filter that brings out salient features from the input set. The filter sometimes called kernel is represented by a set of n-dimensional weights and describes the probabilities that a given pattern of input values represents a feature. A deconvolutional layer comes from a transposed convolutional process where the data is enhanced to increase resolution or to transform. A recurrent layer includes a looping capability such that its input consists of both the data to analyze as well as the output from a previous calculation performed by that layer. This is helpful to maintain state across iterations and for transforming one sequence to another.

The choice to apply machine learning techniques is dependent both on the applicability of the algorithm as well as the data.

Friday, April 2, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Collaborative filtering can be applied via Item-based filtering as well. This is a different use case from the earlier cited for user-based filtering in that the item-based filtering avoids the divulging of users in the participant group and instead focuses on item similarity from a lookup table which makes it fast albeit storage expensive. In both cases, the similarity scores are computed but this approach allows us to answer the question whether the set of grants are like others which helps us rank them. This is useful for sparse data set which is typical for the matrix of appreciation across users.

Sequence clustering provides insights into the activities that generated the appreciation because it determines patterns across users and grants by finding paths in sequence. A sequence is a series of events such as a set of appreciations in the form of reward point grants. This kind of sequence analysis helps us understand the activities that were most popular for appreciation purposes between employees and target those actively on other forums. Sequence clustering is a data driven approach. It helps with determining sequences from existing appreciation activities.

Regions of interest are used to determine space and time focus on appreciation activity. This is helpful to detect events that would have otherwise gone unnoticed as a flurry of activities on the reward points table. Together with the classifier and this regressor, the latent event and awardees can be detected thus eliminating the need to hold formal events and determine winners.

Conclusion: There are several algorithms in data mining that are applicable to the Reward points repository.

Thursday, April 1, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Collaborative filtering is another use case where the binary conditions apply. This is particularly useful when there are multiple participants in a group whose opinions determine the best grant of reward points. In the earlier approaches, the algorithms were articulating conditions. In this algorithm, we avoid the use of conditions and replace it with ratings. The participants in the group can be selected such that they form a diverse set or a cohesive set depending on the purpose. The calculation of grants based on existing reward points can be determined with the help of this opinion group and it helps to avoid many of the pitfalls with the logic associated with conditions. Some of these include the disclosure of rules, taking advantage of the rules, and circumventing them.

Hierarchical clustering is helpful when we want to cluster the reward points to match with the organizational hierarchy to give credit to the manager when their reporting employees do well. This is a standard practice in many companies. It may not be evident from the flat independent grants assigned to individuals that the reward points can be grouped based on the hierarchy to which the user belongs. Distance between members based on organizational hierarchy can also be used as a metric to determine the hierarchical clustering of reward point grants.

Conclusion: There are several algorithms in data mining that are applicable to the Reward points repository.