Cluster computing

Saturday, April 10, 2021

Applications of Data Mining to Reward points collection service

The drift is monitored by joining product quality labels and predicted quality labels and summarized over a time window to trend model quality. Multiple such KPIs can be used to cover the scoring criteria. For example, a lagging indicator could determine if the actual labels are lagging behind arrive delayed compared to predicted labels. Thresholds can be set for the delay to specify the acceptable business requirements and to trigger a notification or alert.

Ranking and scoring are central and critically important to this pipeline and the choice of their algorithms can make a huge difference in terms of how the model makes predictions for new data. If the scoring were to be sensitive to drift in concept, data, and upstream systems, it would be more accurate, consistent, and avoid deterioration.

The real-world data arrives continuously hence the performance and the quality of the model need to be evaluated continuously. The performance and quality provide independent considerations to tune the model so they are both needed. The former can be indicated by the platform on which the model runs but the latter is more domain and model specific, so it must be decided before the model is deployed and as part of the scoring pipeline.

The pipeline itself becomes a reusable asset that can be used in automation and remote invocations. This makes it easy to compare and deploy model variations so that when we are not just tuning the same model, we can also evaluate them side by side. All models deteriorate over time but the refreshed model tends to pull up the quality over a longer duration.

When the machine learning pipeline is executed well, the following goals are achieved. First, the data remains immutable. Second, the business logic is written once. Finally, the data is available in real-time.

The model deterioration and drift are avoided without requiring updates to the data and logic.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

A coding exercise: https://tinyurl.com/nnj5vd8v

Friday, April 9, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Features and labels are helpful for the evaluation of the model as well. When the data comes from IoT sensors, it is typically streaming in nature. This makes it suitable for streaming stacks such as Apaches Kafka and Flink. The production models are loaded in a scoring pipeline to get predicted product quality.

The pipeline itself becomes a re-usable asset that can be used in automation and remote invocations. This makes it easy to compare and deploy model variations so that when we are not just tuning the same model, we can also evaluate them side by side. All models deteriorate over time but the refreshed model tends to pull up the quality over a longer duration.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Thursday, April 8, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Other than the platform metrics to help monitor and troubleshoot issues with the production deployment of machine learning systems, the model itself may have performance and quality metrics that can be used to evaluate and tune it. These metrics and key performance indicators can be domain-specific such as accuracy which is the ratio of the number of correct predictions to the number of total predictions, confusion matrix of positive and negative predictions for all the class labels in classification, Area under the Receiver Operating Characteristic ROC curve and the area under the ROC curve (AUC), F1 Score using precision and recall, entropy loss, mean squared error and mean absolute error. These steps for the post-processing of predictions are just as important as the data preparation steps for a good performing model.

One of the often-overlooked aspects of deploying machine learning models in production is that a good infrastructure alleviates the concerns from model deployment, tuning, and performance improvements. A platform such as a container orchestration framework such as Docker and a resource reconciliation framework such as Kubernetes removes all the concerns about load-balancing, scalability, HTTP proxy, ingress control, and monitoring as the model hosted in a container can be scaled out to as many running instances as needed while the infrastructure is better positioned to meet the demands of the peak load with no semantic changes to the model.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Wednesday, April 7, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Monitoring and pipeline contribute significantly towards streamlining the process and answering questions such as why did the model predict this? When was it trained? Who deployed it? Which release was it deployed in? At what time was the production system updated? What were the changes in the predictions? What did the key performance indicators show after the update? Public cloud services have enabled both ML pipeline and their monitoring. The steps involved in creating a pipeline usually involves configuring a workspace and creating a datastore, downloading and storing sample data, registering, and using objects for transferring intermediate data between pipeline steps, downloading, and registering the model, creating, and attaching the remote computer target, writing a processing script, building the pipeline by setting up the environment and stack necessary to execute the script that is run in this pipeline, creating the configuration to wrap the script, creating the pipeline step with the above-mentioned environment, resource, input and output data, and reference to the script, and submitting the pipeline. Many of these steps are easily automated with the help of built-in objects published by the public cloud services to build and run such a pipeline. A pipeline is a reusable object and one can that can be invoked over the wire with a web request.

Machine learning services collect the same kinds of monitoring data as the other public cloud resources. These logs, metrics, and events can then be collected, routed, and analyzed to tune the machine learning model.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Tuesday, April 6, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

In the previous section, we discussed the training data and the deployment of the trained model. This does not complete the production system. On the contrary, it is just the beginning of the lifecycle for that model. Over the time that the model is used for prediction, it’s accuracy or predictive power may deteriorate. This occurs due to one of the following three categories of changes: changes in the concept, changes to the data and changes in the upstream systems. The first reflects changes to the assumptions made when building the model. As with all business requirements, they may change over time and the assumptions made earlier may not hold true or they might need to be improved. For example, the fraud detection model may have encapsulated a set of policies that might need to be changed or the statistical model may have made assumptions on the prediction variable that might need to be redefined. The second type of deterioration comes from differences in training and test data. Usually, a 70/30 percentage split allows us to find and overcome all the eccentricities in the data, but the so-called test data is real world data that arrives continuously unlike the frozen training data. It might change over time or show preferences and variations that were not known earlier. Such change requires the model to be tuned. Lastly, the upstream data changes can be operational changes that change the data quality and consequently impact the model. These changes and the deterioration caused to the model are collectively called the drift and the ways to overcome the drift include ways to measure and actively improve the model. The metrics are called Model performance metrics and model quality metrics.

Monitoring and pipeline contribute significantly towards streamlining the process and answering questions such as why did the model predict this? When was it trained? Who deployed it? Which release was it deployed in? At what time was the production system updated? What were the changes in the predictions? What did the key performance indicators show after the update? Public cloud services have enabled both ML pipeline and their monitoring. The steps involved in creating a pipeline usually involves configuring a workspace and creating a datastore, downloading and storing sample data, registering, and using objects for transferring intermediate data between pipeline steps, downloading, and registering the model, creating, and attaching the remote computer target, writing a processing script, building the pipeline by setting up the environment and stack necessary to execute the script that is run in this pipeline, creating the configuration to wrap the script, creating the pipeline step with the above mentioned environment, resource, input and output data, and reference to the script, and submitting the pipeline. Many of these steps are easily automated with the help of builtin objects published by the public cloud services to build and run such a pipeline. A pipeline is a reusable object and one can that can be invoked over the wire with a web-request.

Machine learning services collect the same kinds of monitoring data as the other public cloud resources. These logs, metrics and events can then be collected, routed, and analyzed to tune the machine learning model.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Monday, April 5, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

One of the highlights of the machine learning deployments, as opposed to the deployment of data mining models, is that the model can be built and tuned in one place and run anywhere else. The client-friendly version of TensorFlow allows the model to run on clients with little resource as mobile devices. The environment for model building usually supports GPU. This works well to create a production pipeline where the data can be sent to the model-independent of where the training data was kept and used to train the model. Since the training data flows into the training environment, its pipeline is internal. The test data can be sent to the model wherever it is hosted over the wire as web requests. The model can be run on containers on the server-side or even in the browser on the client-side. Another highlight of the difference between ML environment pipeline and the data mining pipeline is the heterogeneous mix of technologies and products on the ML side as opposed to the homogeneous relational database-based stack on the data mining side. For example, logs, streams, and events may be streamed into the production pipeline via Apache Kafka, processed using Apache Flink, the kernels built with SciKit, Keras, or Spark-ML and the trained models run on containers taking and responding to web requests.

In the previous section, we discussed the training data and the deployment of the trained model. This does not complete the production system. On the contrary, it is just the beginning of the lifecycle for that model. Over the time that the model is used for prediction, its accuracy or predictive power may deteriorate. This occurs due to one of the following three categories of changes: changes in the concept, changes to the data, and changes in the upstream systems. The first reflects changes to the assumptions made when building the model. As with all business requirements, they may change over time and the assumptions made earlier may not hold true or they might need to be improved. For example, the fraud detection model may have encapsulated a set of policies that might need to be changed or the statistical model may have made assumptions on the prediction variable that might need to be redefined. The second type of deterioration comes from differences in training and test data. Usually, a 70/30 percentage split allows us to find and overcome all the eccentricities in the data but the so-called test data is real-world data that arrives continuously unlike the frozen training data. It might change over time or show preferences and variations that were not known earlier. Such change requires the model to be tuned. Lastly, the upstream data changes can be operational changes that change the data quality and consequently impact the model. These changes and the deterioration caused to the model are collectively called the drift and the ways to overcome the drift include ways to measure and actively improve the model. The metrics are called Model performance metrics and model quality metrics.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Sunday, April 4, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Neural network can be applied in layers and they can be combined with regressors so the technique can be used for a variety of use cases. There are four different types of neural networks. The fully connected layer which connects every neuron in one layer to every neuron in another layer. This is great for rigorous encoding, but it becomes expensive for large inputs and scalability. The convolutional layer is mostly used as a filter that brings out salient features from the input set. The filter sometimes called kernel is represented by a set of n-dimensional weights and describe the probabilities that a given pattern of input values represents a feature. A deconvolutional layer comes from a transposed convolutional process where the data is enhanced to increase resolution or to transform. A recurrent layer includes a looping capability such that its input consists of both the data to analyze as well as the output from a previous calculation performed by that layer. This is helpful to maintain state across iterations and for transforming one sequence to another.

The choices to apply machine learning techniques is dependent both on the applicability of the algorithm as well as the data. For example, we use a Convolutional Neural Network when we want to perform only classification. We use a Recurrent Neural Network when we want to retain state between encodings such as with sequences. We use classifier and regressor when we want to detect objects and their bounding box. The choices also vary with the data. CNN works great with Tensors which are distinct and independent from one another. The output using Tensors for a K-Nearest neighbors consists of a label with the most confidence which is a statistical parameter based on the support for the label, a class index, and a score set for the confidence associated with each label. Scalar data works very well for matrix and matrix operations. RNN works well with sequence of inputs.

One of the highlights of the machine learning deployments as opposed to the deployment of data mining models is that the model can be built and tuned in one place and run anywhere else. The client friendly version of TensorFlow allows the model to run on clients with little resource as mobile devices. The environment for model building usually supports GPU. This works well to create a production pipeline where the data can be sent to the model independent of where the training data was kept and used to train the model. Since the training data flows into the training environment, its pipeline is internal. The test data can be sent to the model wherever it is hosted over the wire as web requests. The model can be run on containers on the server side or even in the browser on the client side. Another highlight of the difference between ML environment pipeline and the data mining pipeline is the heterogeneous mix of technologies and products on the ML side as opposed to the homogeneous relational database-based stack on the data mining side. For example, logs, streams and events may be streamed into the production pipeline via Apache Kafka, processed using Apache Flink, the kernels built with SciKit, Keras or Spark-ML and the trained models run on containers taking and responding to web-requests.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.