Cluster computing

Monday, April 5, 2021

Applications of Data Mining to Reward points collection service

One of the highlights of the machine learning deployments, as opposed to the deployment of data mining models, is that the model can be built and tuned in one place and run anywhere else. The client-friendly version of TensorFlow allows the model to run on clients with little resource as mobile devices. The environment for model building usually supports GPU. This works well to create a production pipeline where the data can be sent to the model-independent of where the training data was kept and used to train the model. Since the training data flows into the training environment, its pipeline is internal. The test data can be sent to the model wherever it is hosted over the wire as web requests. The model can be run on containers on the server-side or even in the browser on the client-side. Another highlight of the difference between ML environment pipeline and the data mining pipeline is the heterogeneous mix of technologies and products on the ML side as opposed to the homogeneous relational database-based stack on the data mining side. For example, logs, streams, and events may be streamed into the production pipeline via Apache Kafka, processed using Apache Flink, the kernels built with SciKit, Keras, or Spark-ML and the trained models run on containers taking and responding to web requests.

In the previous section, we discussed the training data and the deployment of the trained model. This does not complete the production system. On the contrary, it is just the beginning of the lifecycle for that model. Over the time that the model is used for prediction, its accuracy or predictive power may deteriorate. This occurs due to one of the following three categories of changes: changes in the concept, changes to the data, and changes in the upstream systems. The first reflects changes to the assumptions made when building the model. As with all business requirements, they may change over time and the assumptions made earlier may not hold true or they might need to be improved. For example, the fraud detection model may have encapsulated a set of policies that might need to be changed or the statistical model may have made assumptions on the prediction variable that might need to be redefined. The second type of deterioration comes from differences in training and test data. Usually, a 70/30 percentage split allows us to find and overcome all the eccentricities in the data but the so-called test data is real-world data that arrives continuously unlike the frozen training data. It might change over time or show preferences and variations that were not known earlier. Such change requires the model to be tuned. Lastly, the upstream data changes can be operational changes that change the data quality and consequently impact the model. These changes and the deterioration caused to the model are collectively called the drift and the ways to overcome the drift include ways to measure and actively improve the model. The metrics are called Model performance metrics and model quality metrics.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Cluster computing

Monday, April 5, 2021

No comments:

Post a Comment