Cluster computing

Building a machine learning platform - some inspirations and thumb rules

Enterprises are waking up to drive machine learning into their systems. This is made clear by the increase in the steeply increasing investments year after year.

Platforms and tools are increasingly being enhanced that seem to serve a wide variety of applications. Many organizations now expect ML dedicated engineers or teams just as they exist for other aspects of data science.

However the applications and use cases should continue to drive these investments in any organization. For example, traditional applications have improved capabilities with ML scoring and recommendations. They may even be used where automation is the only way to sift through large data sets.

Its true that most of the ML platforms come from companies that have public clouds or operate at global social networking scale. They make their platforms open source, widely available and well maintained. This is enticing to the application developers. Moreover, these platforms serve accepted versions of popular algorithms. But this has also been a limitation because developers first try to fit the platform to their application and then enter a consultancy with different teams to satisfy their requirements. Instead if it were on a use case by use case basis, then the platform would have evolved from the grounds up with rapid delivery for each use case.

When the development efforts are spearheaded by the use case, it will become clearer to separate the content efforts from the machine learning algorithm efforts. Although the latter may be referred as a platform, the emphasis here is not so much as system design or production environment versus lab experimentation but more so in terms of separating the code from the pipeline. There is universally accepted delay in getting a working solution off the ground with the existing data in the initial phase but it's arguably the best time to reduce the debt going forward by learning what works. Developers generally like to use the term service for an area of computation such predictive modeling and build an integration layer to handle the interaction of consumers to the service. The integration layer is not just about marshaling data from one data source to the service but also about the end to end experience. This is served by a pipeline of data so called because t is a forward only movement of data from the origination source through the candidate generation, feature extraction, scoring and post processing. The pipeline brings consideration to managing the data in batches or streaming where necessary, performing local calculations and graceful culmination. Most people prefer to use big data sources or graph databases as the choice repository for ML processing. This is not necessarily as restricted and relational databases and data mining can be just as helpful as the use cases may require. Each stage of the pipeline may have its own considerations or systems. For example, feature extraction may require vectors to be created from the data so that the models can then be built on top of the features. The machine learning platform can have two different paths for data. The first path is the model building path is usually taken from the staging data store whether it is a graph database or an S3 store. The second path is the prediction and analysis path that can work either directly from the staged data or from the models built from it. If the database is classified/predicted, it will make its way through an upload and merge with the originating data for the next wave of download to staging, model building and prediction repetition cycles on a scheduled basis. Data increases with time and some of the stages in the pipeline are dependent on others. And this is just for one model which means the data may increase by orders of magnitude for other models. Consequently the data pipeline is often visualized as a complicated directed graph. Depending on the size of data, many engineers prefer to use distributed computing such as clusters and containers along with the choice of debugging tools, monitoring, alerting, data transfer technologies as well as tools like Spark, sklearn and Turi.

Testing the pipeline and the integration system is generally focused on relevance, quality and demand. Each of these focus areas come not only with their own metrics but also instruments for tuning the system for satisfactory performance
Courtesy Preethi Rathi for business impact and book on programming collective intelligence.

#codingexercise

There are n trees in a circle. Each tree has a fruit value associated with it. A bird can sit on a tree for 0.5 sec and then he has to move to a neighbouring tree. It takes the bird 0.5 seconds to move from one tree to another. The bird gets the fruit value when she sits on a tree. We are given n and m (the number of seconds the bird has), and the current position and V the fruit values of the trees. We have to maximise the total fruit value that the bird can gather. The bird can start from any tree.

int GetFruitValue(List<int> V, double n, int m)

{

if ( n <= 0 ) return 0;

if m == V.Count return 0;

return max(V[m%V.count] +GetFruitValue(V, n-1, m+1), GetFruitValue(V,n-0.5, m+1));

}

Cluster computing

Monday, May 8, 2017

No comments:

Post a Comment