Cluster computing

Thursday, October 26, 2023

Potential applications of machine learning

The MicrosoftML package provides fast and scalable machine learning algorithms for classification, regression and anomaly detection.

The rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Payment default prediction and Email Spam filtering.
The rxOneClassSVM is used for anomaly detection such as in credit card fraud detection. It is a simple one class support vector machine which helps detect outliers that do not belong to some target class because the training set contains only examples from the target class.
The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for bankruptcy prediction. It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next.
The rxFastForest is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model.
The rxNeuralNet is a neural network implementation that helps with multi class classification and regression. It is helpful for applications say signature prediction, OCR, click prediction. A neural network is a weighted directed graph arranged in layers where the nodes in one layer are connected by a weighted edge to the nodes in another layer. This algorithm tries to adjust the weights on the graph edges based on the training data.
The rxLogisticRegression is a binary and multiclass classification that classifies sentiments from feedback. This is a regular regression model where the variable that determines the category is dependent on one or more independent variables that have a logistic distribution.

Wednesday, October 25, 2023

MLOps:

Most machine learning deployment patterns comprise of two types of deployment patterns – online inference and batch inference. Both demonstrate MLOps principles and best practices when developing, deploying, and monitoring machine learning models at scale. Development and deployment are distinct from one another and although the model may be containerized and retrieved for execution during deployment, it can be developed independent of how it is deployed. This separates the concerns for the development of the model from the requirements to address the online and batch workloads. Regardless of the technology stack and the underlying resources used during these two phases; typically, they are created in the public cloud; this distinction serves the needs of the model as well.
For example, developing and training a model might require significant computing but not so much as when executing it for predictions and outlier detections, activities that are hallmarks of production environments. Even the workloads that make use of the model might vary even from one batch processing stack to another and not just between batch and online processing but the common operations of collecting MELT data, named after metrics, events, logs and traces and associated resources will stay the same. These include GitHub repository, Azure Active Directory, cost management dashboards, Key Vaults, and in this case, Azure Monitor. Resources and the practice associated with them for the purposes of security and performance are being left out of this discussion, and the standard DevOps guides from the public cloud providers call them out.

Online workloads targeting the model via API calls will usually require the model to be hosted in a container and exposed via API management services. Batch workloads, on the other hand, require an orchestration tool to co-ordinate the jobs consuming the model. Within the deployment phase, it is a usual practice to host more than one environment such as stage and production – both of which are served by CI/CD pipelines that flows the model from development to its usage. A manual approval is required to advance the model from the stage to the production environment. A well-developed model is usually a composite handling three distinct model activities – handling the prediction, determining the data drift in features, and determining outliers in the features. Mature MLOps also includes processes for explainability, performance profiling, versioning and pipeline automations and such others. Depending on the resources used for DevOps and the environment, typical artifacts would include dockerfiles, templates and manifests.

While parts of the solution for this MLOps can be internalized by studios and launch platforms, organizations like to invest in specific compute, storage and networking for their needs. Databricks/Kubernetes, Azure ML workspaces and such are used for compute, storage accounts and datastores are used for storage, and diversified subnets are used for networking. Outbound internet connectivity from the code hosted and executed in MLOps is usually not required but it can be provisioned with the addition of a NAT gateway within the subnet where it is hosted.

Tuesday, October 24, 2023

The application of data mining and machine learning techniques to Reverse Engineering of IaC.

An earlier article introduced the notion and purpose of reverse engineering. This article explains why and how IaC and application code can become quite complex and require reverse engineering.

Software organization seldom appears simple and straightforward even for the microservice architecture. With IaC becoming the defacto standard for describing infrastructure and deployments from cross-cutting business objectives, they can become quite complex, multi-layered, differing in their physical and logical organizations, and requiring due diligence in their reverse engineering

The premise for doing this is like what a compiler does in creating a symbol table and maintaining dependencies. We recognize that the symbols as nodes and their dependencies as edges presents a rich graph on which relationships can be superimposed and queried for different insights. These insights help with better representation of the knowledge model. Well-known data mining algorithms can assist with this reverse engineering. Even a basic linear or non-linear ranking of the symbols and thresholding them can be very useful towards representing the architecture.

We cover just a few of the data mining algorithms to begin with and close that with a discussion on machine learning methods including SoftMax classification that can make excellent use of co-occurrence data. Finally, we suggest that this does not need to be a one-pass KDM builder and that the use of pipeline and metrics can be helpful towards incremental or continually enhancing the KDM. The symbol and dependency graph are merely the persistence of information learned which can be leveraged for analysis and reporting such as rendering a KDM.

Types of analysis:

Classification algorithms

Regression algorithms

Segmentation algorithms

Association algorithms

Sequence Analysis Algorithms

Outliers Mining Algorithm

Decision tree

Logistic Regression

Neural Network

Naïve Bayes algorithm

Plugin Algorithms

Simultaneous classifiers and regions-of-interest regressors

Collaborative filtering

Collaborative Filtering via Item-based filtering

Hierarchical clustering

NLP algorithms

Where Lucene search indexes and symbol store fail, the data mining insights to code organizations makes up for elaborate knowledge model.

Monday, October 23, 2023

This is a continuation of previous articles on Azure Databricks instances. Azure Databricks is a fast and powerful Apache Spark based analytics service that makes it easy to rapidly develop and deploy Big Data analytics and artificial intelligence solutions. One can author models in Azure Databricks and register them with mlflow during experimentation for prediction, drift detection and outlier detection purposes. Databricks monitoring is as complicated as it is important, partly because there are so many destinations for charts and dashboards and some with overlaps. The Azure monitoring and associated dashboard provides Job Latency, Stage Latency, Task Latency, Sum task execution per host, task metrics, cluster throughput, Streaming throughput/latency and resource consumption per executor, executor compute time metrics and shuffle metrics.

Azure’s cost management dashboard for databricks is also helpful to track costs on workspaces and cluster basis which provides a helpful set of utilities for administrative purposes. Tags can be added to workspaces as well as clusters.

Apache Spark UI on Databricks instance also provides information on what’s happening in your application. The Spark UI provides information on input rate for streaming events, processing time, completed batches, batch details, job details, task details and so on. The driver log is helpful for exceptions and print statements while the executor log is helpful to detect errant and runaway tasks.

But the most comprehensive set of charts and graphs are provided by Overwatch.

Overwatch can be considered an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost.

Overwatch provides categorized Dashboard pages for Workspace, clusters, job and notebooks. Metrics available on the workspace dashboard include daily cluster spend chart, cluster spend on each workspace, DBU cost vs compute cost, cluster spend by type on each workspace, cluster count by type on each workspace, and count of scheduled jobs on each workspace. Metrics available on the cluster dashboard include DBU spend by cluster category, DBU spend by the most expensive cluster per day per workspace, top spending clusters per day, DBU spend by the top 3 expensive clusters, cluster count in each category, cluster node-type breakdown, cluster node-type breakdown by the potential, and cluster node potential breakdown by cluster category. Metrics available on the Job dashboard include daily cost on jobs, job count by workspace, and jobs running in interactive clusters. Metrics available on Notebook dashboard include data throughput per notebook path, longest running notebooks, data throughput per notebook path, top notebooks returning a lot of data to the UI, spark actions count, notebooks with the largest records, task count by task type, large tasks count, and count of jobs executing on notebooks.

While these can be an overwhelming number of charts from various sources, a curated list of charts must show 1. Databricks workload types in terms of job compute for data engineers, jobs light compute for data analysis, and all-purpose compute. 2. Consumption based charts in terms of DBU, virtual machine, public ip address, blob storage, managed disk, bandwidth etc., 3. Pricing plans in terms of pay-as-you-go, reservations of DBU, DBCU for 1-3 years and/or on region/duration basis and 4. Tags based in terms of cluster tags, pool tags, and workspace tags. Tags can propagate with clusters created from pools and clusters not created from pools. And 5. Cost calculation in terms of quantity, effective price and effective cost.

Some of this information can come form database schema for overwatch with custom queries such as:
select sku, isActive, any_value(contract_price) * count(*) as cost from overwatch.`DBUcostdetails`
group by sku, isActive
having isActive = true;

These are only some of the ways in which dashboard charts are available. 

Sunday, October 22, 2023

This is a summary of the book “How to stay smart in a smart world – why human intelligence still beats algorithms” written by Gerd Gigerenzer and published by MIT Press 2022. He is a psychologist known for his work on bounded rationality and directs the Harding Center for Risk Literacy at the University of Potsdam. He is also a partner at Simply Rational – The Decision Institute.

Recent advances in artificial intelligence have juxtaposed a different form of intelligence to ours and poses a question about the role of either intelligence. With the spectrum of reactions ranging from embracing it openly to being apprehensive about its prevalence or dominance, the author picks out a cautious approach playing on the strengths and avoiding the weaknesses. With several examples and case studies, he argues that one form of intelligence works well in stable environments with well-defined rules while the other will never lose its relevance outside that world.

The salient points from this book include assertions that AI excels in stable environments and follows rules dictated by humans, AI systems don’t perform well in dynamic environments, filled with uncertainty. Humans must try out AI to get best results. In unexplored territory, simple and transparent algorithms perform better than complex ones. Among the negative impacts, ad-based model from social media platforms can be cited. It’s possible to separate human interaction with machine supervision with clear demarcation. For example, self-driving cars could be given their own dedicated lanes where possible. Market hype and profit incentives can lead companies to overcompromise and underdeliver on digital technologies.

AI wins hands down in many games such as chess, Go etc because it learns the game rules that are fixed, it is tuned by human experts and uses brute calculation to determine the best possible move. The better defined and more stable the premise, the better the performance. The flip side is self-evident with facial recognition for instance that works 99.6% of the time. In dynamic environments, the number drops significantly. When UK police scanned the faces of 170000 soccer fans in a stadium for matches with criminal database, 93% of the matches were false.

AI is good at making correlations with huge amounts of data, even some that would have escaped humans, but it cannot recognize scenarios and deal with ambiguity. For example, Maine’s divorce rate and the United States’ per capita consumption of margarine have a significant correlation but it makes no sense. Its these false findings by AI that makes them even harder to replicate leading to a lot of waste and error in areas such as health science and biotechnology and to the tune of hundreds of billions of dollars. Assertions made today such as eat blueberries to prevent memory loss, eat bananas to get higher verbal SAT score, eat kiwis late at night to sleep better etc may just be the opposite in due time.

Whenever the effectiveness of AI decreases, human intervention can significantly boost their performance. The human brain has a remarkable ability to adapt to constantly changing cues, contexts and situations in what is termed as vicarious functioning. Staying smart means leveraging singularity capabilities but staying in charge. AI lacks four components of common sense – a capacity to think casually, an awareness of others’ intentions and feelings, a basic understanding of space, time and objects, and a longing to join in group norms. Some tasks like recommending the nearest restaurant do not need common sense but the detection of a person crossing the road in a war zone as a threat requires it.

Complex problems do not justify complex solutions. Google Flu trends tried to predict the spread of flu with approximately 160 search terms but they still overpredicted doctors’ visits. In comparison, an algorithm from Max Planck Institute for human development simply used one data point: recent visits to the doctor from the CDC website and performed much better in predicting the flu’s spread.

Information when served subliminally or unknowingly have potential to alter our behavior. This is why ad-based model for social media can be harmful by creating distractions. With attention control technology, the user is held captive by these algorithms. Texting while driving has caused 3000 deaths per year in the United States between 2010 and 2020. In areas other than driving, smartphones have proven to be very distracting.

Finally, the business aspect of artificial intelligence must be realized in the context of historical trends with killer technologies and the commerce behind it. The author says we should be able to profit from AI but not be easily misled with expectations and predictions.

Earlier book summaries: BookSummary10.docx

Saturday, October 21, 2023

This is a continuation of articles on Infrastructure-as-code aka IaC for short.  There’s no denying that IaC can help to create and manage infrastructure and that they can be versioned, reused, and shared – all of which helps to provision resources quickly and consistently and manage them consistently throughout their lifecycle. Unlike software product code that must be general purpose and provide a strong foundation for system architecture and aspiring to be a platform for many use cases, IaC often varies a lot and must be manifested in different combinations depending on environment, purpose and scale and encompass complete development process. It can even include CI/CD platform, DevOps, and testing tools. The DevOps based approach is critical to rapid software development cycles. This makes IaC spread over in a variety of forms. The more articulated the IaC the more predictable and cleaner the deployments. 

One of the greatest advantages of using IaC is its elaborate description of the customizations to a resource type for widespread use within an organization. Among the reasons for customization, company policy enforcement, security and consistency can be called out as the main ones. For example, an Azure ML workspace might require some features to be allowed and others to be disallowed before other members of the organization can start making use of it.

There are several ways to do this. Both ARM Templates and azure cli commands come with directives to turn off features. In fact, the ‘az feature’ command line options are available on a provider-by-provider basis to register or unregister specific features. This ability is helpful separate out the experimental from the production feature set and allows them to be made available independently. Plenty of documentation around the commands and the listing of all such features makes it easy to work on one or more of them directly.

Another option is to display all the configuration corresponding to a resource once it has been selectively turned on from the portal. Since each resource type comes with its own ARM templates as well as command set, it is easy to leverage the built-in ‘show’ command to list all properties of the resource type and then edit some of those properties for a different deployment by virtue of the ‘update’ command. Even if all the properties are not listed for a resource, it is possible to find them in the documentation or by querying many instances of the same resource type.

A third option is to list the operations available on the resource and then set up role-based access control limiting some of those. This approach is favored because users can be switched between roles without affecting the resource or its deployment. It also works for groups and users can be added or removed from both groups and roles. Listing the operations enumerates the associated permissions all of which begin with the provider as the prefix. This listing is thorough and covers all aspect of working with the resources. The custom-role is described in terms of permitted ‘actions’, ‘data-actions’ and ‘not-actions’ where the first two correspond to control and data plane associated actions and the last one corresponds to deny set that takes precedence over control and data plane actions. By appropriately selecting the necessary action privileges and listing them under a specific category without the categories overlapping, we create the custom role with just the minimum number of privileges needed to complete a set of selected tasks.

Last but not the least approach, is to supply an init script with the associated resource, so that as other users start using it, the init script will set the predecided configuration that they must work with. This allows for some degree of control on sub resources and associated containers necessary for an action to complete so that by virtue of removing those resources, an action even if permitted by a role on a resource type, may not be permitted on a specific resource.

IaC is not an immutable asset once it is properly authored. It must be maintained just as any other source code asset. Part of the improvements come from fixes to defects, design changes but in the case of IaC specifically, there are other changes coming from drift detection and cloud security posture management aka CSPM.

Reference: Earlier articles on IaC shortcomings and resolutions: IacResolutionsPart35.docx