Cluster computing

Applications of Data Mining to Reward points collection service

Continuation of discussion in terms of Machine Learning deployments

Machine learning algorithms are a tiny fraction of the overall code that is used to realize prediction systems in production. As noted in the paper on “Hidden Technical Debt in Machine Learning systems” by Sculley, Holt, and others, the machine learning code comprises mainly of the model but all the other components such as configuration, data collection, features extraction, data verification, process management tools, machine resource management, serving infrastructure, and monitoring comprise the rest of the stack. These components are hybrid in nature when they are deployed on-premise. Public clouds lead the way in standardizing deployment, monitoring, and operations for machine learning deployments. Not all development teams are empowered to transition to the public cloud because the costs of usage are difficult to articulate upfront to the management and the billing is based on the pay-as-you-go model. A Continuous Integration / Continuous Deployment (CI/CD) pipeline, ML tests, and model tuning become a responsibility for the development team even though they are folded into the business service team for faster turn-around time to deploy artificial intelligence models in production. In-house automation and development of Machine Learning pipelines and monitoring system do not compare to those from the public clouds which make it easier for automation and programmability. Yet, the transition to the public cloud ML pipeline from in-house solution lags. We review some of the arguments against this migration:

First, ML pipeline is a newer technology as compared to traditional software development stacks and management advises that developers have more freedom to explore options on-premises with less cost. Even high-technology large companies with significant investments in hybrid cloud and their own datacenters argue against the use of public cloud technologies. This is not merely from a business point of view, it is also founded with the technical reason that in-house solutions will be better customized to the ML model developments those companies are looking for. Also, experimentation can get out of control from the limits allowed for free-tier. The cost is not always clear and it always comes down to an argument about the justification of numbers for both options but the cost is considered lower in favor of the hybrid cloud.

Second, event processing systems such as Apache Spark and Kafka find it easier to replace Extract-Transform-Load solutions that proliferated with a data warehouse. It is true that much of the training data for ML pipelines comes from a data warehouse and ETL worsened data duplication and drift making it necessary to add workarounds in business logic. With a cleaner event-driven system, it becomes easier to migrate to immutable data, write-once business logic, and real-time data processing systems. Event processing systems is easier to develop on-premises even as staging before it is attempted to be deployed to the cloud.

Third, Machine learning models are end-products. They can be hosted in a variety of environments, not just the cloud. Some ML users would like to load the model in client applications including those on mobile devices. The model as a service option is rather narrow and does not have to be made available over the internet in all cases especially when the network hop is going to be costly to real-time processing systems. Many IoT traffic and experts agree that the streaming data from edge devices can be quite heavy in traffic where an online on-premise system will out-perform any public-cloud option. Internet TCP relays are of the order of 250-350 milliseconds whereas the ingestion rate for real-time analysis can be upwards of thousands of events per second.

The following chart makes a comparison of all the data mining algorithms including the neural networks: https://1drv.ms/w/s!Ashlm-Nw-wnWxBFlhCtfFkoVDRDa?e=aVT37e

Thank you.

Cluster computing

Monday, April 12, 2021

No comments:

Post a Comment