Cluster computing

Applications of Data Mining to Reward points collection service

In the previous section, we discussed the training data and the deployment of the trained model. This does not complete the production system. On the contrary, it is just the beginning of the lifecycle for that model. Over the time that the model is used for prediction, it’s accuracy or predictive power may deteriorate. This occurs due to one of the following three categories of changes: changes in the concept, changes to the data and changes in the upstream systems. The first reflects changes to the assumptions made when building the model. As with all business requirements, they may change over time and the assumptions made earlier may not hold true or they might need to be improved. For example, the fraud detection model may have encapsulated a set of policies that might need to be changed or the statistical model may have made assumptions on the prediction variable that might need to be redefined. The second type of deterioration comes from differences in training and test data. Usually, a 70/30 percentage split allows us to find and overcome all the eccentricities in the data, but the so-called test data is real world data that arrives continuously unlike the frozen training data. It might change over time or show preferences and variations that were not known earlier. Such change requires the model to be tuned. Lastly, the upstream data changes can be operational changes that change the data quality and consequently impact the model. These changes and the deterioration caused to the model are collectively called the drift and the ways to overcome the drift include ways to measure and actively improve the model. The metrics are called Model performance metrics and model quality metrics.

Monitoring and pipeline contribute significantly towards streamlining the process and answering questions such as why did the model predict this? When was it trained? Who deployed it? Which release was it deployed in? At what time was the production system updated? What were the changes in the predictions? What did the key performance indicators show after the update? Public cloud services have enabled both ML pipeline and their monitoring. The steps involved in creating a pipeline usually involves configuring a workspace and creating a datastore, downloading and storing sample data, registering, and using objects for transferring intermediate data between pipeline steps, downloading, and registering the model, creating, and attaching the remote computer target, writing a processing script, building the pipeline by setting up the environment and stack necessary to execute the script that is run in this pipeline, creating the configuration to wrap the script, creating the pipeline step with the above mentioned environment, resource, input and output data, and reference to the script, and submitting the pipeline. Many of these steps are easily automated with the help of builtin objects published by the public cloud services to build and run such a pipeline. A pipeline is a reusable object and one can that can be invoked over the wire with a web-request.

Machine learning services collect the same kinds of monitoring data as the other public cloud resources. These logs, metrics and events can then be collected, routed, and analyzed to tune the machine learning model.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Cluster computing

Tuesday, April 6, 2021

No comments:

Post a Comment