Cluster computing

Tuesday, April 13, 2021

Applications of Data Mining to Reward points collection service

Continuation of discussion in terms of Machine Learning deployments

Some of the other advantages in deploying machine learning models to the public cloud include the following:

1) Readymade automation for machine learning pipelines that can be monitored 24x7.

2) Ability to span on-premises and public cloud with virtual hybrid cloud

3) Elasticity of computing resources for machine learning workload including support for GPU

4) building consistency into machine learning deployments

5) Machine Learning deployments can have variable workloads during the lifetime of the model. The cloud resources are better able to scale up and down as needed.

6) ML solutions can take advantage of all the data at once in the cloud without waiting for Extract-Transform-Load that had become a necessity with warehouses. Even virtual data warehouses are available in the cloud if they must be used.

7) Cloud security is robust and this secures the data at rest as well as transit reducing the onus around the maintenance of data in the cloud.

8) Cost is transparent in the pay-as-you-go mode of billing and various tools are available to monitor usage and costs

9) Rate limiting technologies are numerous in addition to native techniques in the cloud and these can prevent the overrun of costs during experimentation

10) Free tier is available for quick and dirty prototyping in the public cloud that would help to find hidden costs for production systems.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Monday, April 12, 2021

Applications of Data Mining to Reward points collection service

Continuation of discussion in terms of Machine Learning deployments

Machine learning algorithms are a tiny fraction of the overall code that is used to realize prediction systems in production. As noted in the paper on “Hidden Technical Debt in Machine Learning systems” by Sculley, Holt, and others, the machine learning code comprises mainly of the model but all the other components such as configuration, data collection, features extraction, data verification, process management tools, machine resource management, serving infrastructure, and monitoring comprise the rest of the stack. These components are hybrid in nature when they are deployed on-premise. Public clouds lead the way in standardizing deployment, monitoring, and operations for machine learning deployments. Not all development teams are empowered to transition to the public cloud because the costs of usage are difficult to articulate upfront to the management and the billing is based on the pay-as-you-go model. A Continuous Integration / Continuous Deployment (CI/CD) pipeline, ML tests, and model tuning become a responsibility for the development team even though they are folded into the business service team for faster turn-around time to deploy artificial intelligence models in production. In-house automation and development of Machine Learning pipelines and monitoring system do not compare to those from the public clouds which make it easier for automation and programmability. Yet, the transition to the public cloud ML pipeline from in-house solution lags. We review some of the arguments against this migration:

First, ML pipeline is a newer technology as compared to traditional software development stacks and management advises that developers have more freedom to explore options on-premises with less cost. Even high-technology large companies with significant investments in hybrid cloud and their own datacenters argue against the use of public cloud technologies. This is not merely from a business point of view, it is also founded with the technical reason that in-house solutions will be better customized to the ML model developments those companies are looking for. Also, experimentation can get out of control from the limits allowed for free-tier. The cost is not always clear and it always comes down to an argument about the justification of numbers for both options but the cost is considered lower in favor of the hybrid cloud.

Second, event processing systems such as Apache Spark and Kafka find it easier to replace Extract-Transform-Load solutions that proliferated with a data warehouse. It is true that much of the training data for ML pipelines comes from a data warehouse and ETL worsened data duplication and drift making it necessary to add workarounds in business logic. With a cleaner event-driven system, it becomes easier to migrate to immutable data, write-once business logic, and real-time data processing systems. Event processing systems is easier to develop on-premises even as staging before it is attempted to be deployed to the cloud.

Third, Machine learning models are end-products. They can be hosted in a variety of environments, not just the cloud. Some ML users would like to load the model in client applications including those on mobile devices. The model as a service option is rather narrow and does not have to be made available over the internet in all cases especially when the network hop is going to be costly to real-time processing systems. Many IoT traffic and experts agree that the streaming data from edge devices can be quite heavy in traffic where an online on-premise system will out-perform any public-cloud option. Internet TCP relays are of the order of 250-350 milliseconds whereas the ingestion rate for real-time analysis can be upwards of thousands of events per second.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Sunday, April 11, 2021

Applications of Data Mining to Reward points collection service

Continuation of discussion in terms of Machine Learning deployments

Hybrid stacks is not the only concern. There are a few other concerns as well. Architectural patterns are harder to enforce with Machine Learning deployments. Traditional web application deployments have significant and growing eco-system of infrastructure, tools and processes to benefit from. But machine learning systems are not always equivalent to a predictive web service. Many models are trained and tested with little or no requirements for outside world connectivity or programmability. Again, the public clouds lead the way in standardizing deployment, monitoring and operations for machine learning deployments.

Lastly, the machine learning field is emerging, and development teams continuously try and experiment with algorithms, data and technology stacks before establishing a process that lets them switch between use cases and production deployments. A Continuous Integration / Continuous Deployment (CI/CD) pipeline, ML tests and model tuning become a responsibility for the development team even though they are folded into the business service team for faster turn-around time to deploy artificial intelligence models in production. Public clouds make it easy to monitor, troubleshoot and update models in production system deployments but the development team continues to be responsible for the number and scale of such deployments.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Saturday, April 10, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

The drift is monitored by joining product quality labels and predicted quality labels and summarized over a time window to trend model quality. Multiple such KPIs can be used to cover the scoring criteria. For example, a lagging indicator could determine if the actual labels are lagging behind arrive delayed compared to predicted labels. Thresholds can be set for the delay to specify the acceptable business requirements and to trigger a notification or alert.

Ranking and scoring are central and critically important to this pipeline and the choice of their algorithms can make a huge difference in terms of how the model makes predictions for new data. If the scoring were to be sensitive to drift in concept, data, and upstream systems, it would be more accurate, consistent, and avoid deterioration.

The real-world data arrives continuously hence the performance and the quality of the model need to be evaluated continuously. The performance and quality provide independent considerations to tune the model so they are both needed. The former can be indicated by the platform on which the model runs but the latter is more domain and model specific, so it must be decided before the model is deployed and as part of the scoring pipeline.

The pipeline itself becomes a reusable asset that can be used in automation and remote invocations. This makes it easy to compare and deploy model variations so that when we are not just tuning the same model, we can also evaluate them side by side. All models deteriorate over time but the refreshed model tends to pull up the quality over a longer duration.

When the machine learning pipeline is executed well, the following goals are achieved. First, the data remains immutable. Second, the business logic is written once. Finally, the data is available in real-time.

The model deterioration and drift are avoided without requiring updates to the data and logic.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

A coding exercise: https://tinyurl.com/nnj5vd8v

Friday, April 9, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Features and labels are helpful for the evaluation of the model as well. When the data comes from IoT sensors, it is typically streaming in nature. This makes it suitable for streaming stacks such as Apaches Kafka and Flink. The production models are loaded in a scoring pipeline to get predicted product quality.

The pipeline itself becomes a re-usable asset that can be used in automation and remote invocations. This makes it easy to compare and deploy model variations so that when we are not just tuning the same model, we can also evaluate them side by side. All models deteriorate over time but the refreshed model tends to pull up the quality over a longer duration.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Thursday, April 8, 2021

Applications of Data Mining to Reward points collection service

Continuation of use cases:

Other than the platform metrics to help monitor and troubleshoot issues with the production deployment of machine learning systems, the model itself may have performance and quality metrics that can be used to evaluate and tune it. These metrics and key performance indicators can be domain-specific such as accuracy which is the ratio of the number of correct predictions to the number of total predictions, confusion matrix of positive and negative predictions for all the class labels in classification, Area under the Receiver Operating Characteristic ROC curve and the area under the ROC curve (AUC), F1 Score using precision and recall, entropy loss, mean squared error and mean absolute error. These steps for the post-processing of predictions are just as important as the data preparation steps for a good performing model.

One of the often-overlooked aspects of deploying machine learning models in production is that a good infrastructure alleviates the concerns from model deployment, tuning, and performance improvements. A platform such as a container orchestration framework such as Docker and a resource reconciliation framework such as Kubernetes removes all the concerns about load-balancing, scalability, HTTP proxy, ingress control, and monitoring as the model hosted in a container can be scaled out to as many running instances as needed while the infrastructure is better positioned to meet the demands of the peak load with no semantic changes to the model.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.