Cluster computing

Thursday, October 26, 2023

Potential applications of machine learning

The MicrosoftML package provides fast and scalable machine learning algorithms for classification, regression and anomaly detection.

The rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Payment default prediction and Email Spam filtering.
The rxOneClassSVM is used for anomaly detection such as in credit card fraud detection. It is a simple one class support vector machine which helps detect outliers that do not belong to some target class because the training set contains only examples from the target class.
The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for bankruptcy prediction. It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next.
The rxFastForest is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model.
The rxNeuralNet is a neural network implementation that helps with multi class classification and regression. It is helpful for applications say signature prediction, OCR, click prediction. A neural network is a weighted directed graph arranged in layers where the nodes in one layer are connected by a weighted edge to the nodes in another layer. This algorithm tries to adjust the weights on the graph edges based on the training data.
The rxLogisticRegression is a binary and multiclass classification that classifies sentiments from feedback. This is a regular regression model where the variable that determines the category is dependent on one or more independent variables that have a logistic distribution.

Wednesday, October 25, 2023

MLOps:

Most machine learning deployment patterns comprise of two types of deployment patterns – online inference and batch inference. Both demonstrate MLOps principles and best practices when developing, deploying, and monitoring machine learning models at scale. Development and deployment are distinct from one another and although the model may be containerized and retrieved for execution during deployment, it can be developed independent of how it is deployed. This separates the concerns for the development of the model from the requirements to address the online and batch workloads. Regardless of the technology stack and the underlying resources used during these two phases; typically, they are created in the public cloud; this distinction serves the needs of the model as well.
For example, developing and training a model might require significant computing but not so much as when executing it for predictions and outlier detections, activities that are hallmarks of production environments. Even the workloads that make use of the model might vary even from one batch processing stack to another and not just between batch and online processing but the common operations of collecting MELT data, named after metrics, events, logs and traces and associated resources will stay the same. These include GitHub repository, Azure Active Directory, cost management dashboards, Key Vaults, and in this case, Azure Monitor. Resources and the practice associated with them for the purposes of security and performance are being left out of this discussion, and the standard DevOps guides from the public cloud providers call them out.

Online workloads targeting the model via API calls will usually require the model to be hosted in a container and exposed via API management services. Batch workloads, on the other hand, require an orchestration tool to co-ordinate the jobs consuming the model. Within the deployment phase, it is a usual practice to host more than one environment such as stage and production – both of which are served by CI/CD pipelines that flows the model from development to its usage. A manual approval is required to advance the model from the stage to the production environment. A well-developed model is usually a composite handling three distinct model activities – handling the prediction, determining the data drift in features, and determining outliers in the features. Mature MLOps also includes processes for explainability, performance profiling, versioning and pipeline automations and such others. Depending on the resources used for DevOps and the environment, typical artifacts would include dockerfiles, templates and manifests.

While parts of the solution for this MLOps can be internalized by studios and launch platforms, organizations like to invest in specific compute, storage and networking for their needs. Databricks/Kubernetes, Azure ML workspaces and such are used for compute, storage accounts and datastores are used for storage, and diversified subnets are used for networking. Outbound internet connectivity from the code hosted and executed in MLOps is usually not required but it can be provisioned with the addition of a NAT gateway within the subnet where it is hosted.

Tuesday, October 24, 2023

The application of data mining and machine learning techniques to Reverse Engineering of IaC.

An earlier article introduced the notion and purpose of reverse engineering. This article explains why and how IaC and application code can become quite complex and require reverse engineering.

Software organization seldom appears simple and straightforward even for the microservice architecture. With IaC becoming the defacto standard for describing infrastructure and deployments from cross-cutting business objectives, they can become quite complex, multi-layered, differing in their physical and logical organizations, and requiring due diligence in their reverse engineering

The premise for doing this is like what a compiler does in creating a symbol table and maintaining dependencies. We recognize that the symbols as nodes and their dependencies as edges presents a rich graph on which relationships can be superimposed and queried for different insights. These insights help with better representation of the knowledge model. Well-known data mining algorithms can assist with this reverse engineering. Even a basic linear or non-linear ranking of the symbols and thresholding them can be very useful towards representing the architecture.

We cover just a few of the data mining algorithms to begin with and close that with a discussion on machine learning methods including SoftMax classification that can make excellent use of co-occurrence data. Finally, we suggest that this does not need to be a one-pass KDM builder and that the use of pipeline and metrics can be helpful towards incremental or continually enhancing the KDM. The symbol and dependency graph are merely the persistence of information learned which can be leveraged for analysis and reporting such as rendering a KDM.

Types of analysis:

Classification algorithms

Regression algorithms

Segmentation algorithms

Association algorithms

Sequence Analysis Algorithms

Outliers Mining Algorithm

Decision tree

Logistic Regression

Neural Network

Naïve Bayes algorithm

Plugin Algorithms

Simultaneous classifiers and regions-of-interest regressors

Collaborative filtering

Collaborative Filtering via Item-based filtering

Hierarchical clustering

NLP algorithms

Where Lucene search indexes and symbol store fail, the data mining insights to code organizations makes up for elaborate knowledge model.

Monday, October 23, 2023

This is a continuation of previous articles on Azure Databricks instances. Azure Databricks is a fast and powerful Apache Spark based analytics service that makes it easy to rapidly develop and deploy Big Data analytics and artificial intelligence solutions. One can author models in Azure Databricks and register them with mlflow during experimentation for prediction, drift detection and outlier detection purposes. Databricks monitoring is as complicated as it is important, partly because there are so many destinations for charts and dashboards and some with overlaps. The Azure monitoring and associated dashboard provides Job Latency, Stage Latency, Task Latency, Sum task execution per host, task metrics, cluster throughput, Streaming throughput/latency and resource consumption per executor, executor compute time metrics and shuffle metrics.

Azure’s cost management dashboard for databricks is also helpful to track costs on workspaces and cluster basis which provides a helpful set of utilities for administrative purposes. Tags can be added to workspaces as well as clusters.

Apache Spark UI on Databricks instance also provides information on what’s happening in your application. The Spark UI provides information on input rate for streaming events, processing time, completed batches, batch details, job details, task details and so on. The driver log is helpful for exceptions and print statements while the executor log is helpful to detect errant and runaway tasks.

But the most comprehensive set of charts and graphs are provided by Overwatch.

Overwatch can be considered an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost.

Overwatch provides categorized Dashboard pages for Workspace, clusters, job and notebooks. Metrics available on the workspace dashboard include daily cluster spend chart, cluster spend on each workspace, DBU cost vs compute cost, cluster spend by type on each workspace, cluster count by type on each workspace, and count of scheduled jobs on each workspace. Metrics available on the cluster dashboard include DBU spend by cluster category, DBU spend by the most expensive cluster per day per workspace, top spending clusters per day, DBU spend by the top 3 expensive clusters, cluster count in each category, cluster node-type breakdown, cluster node-type breakdown by the potential, and cluster node potential breakdown by cluster category. Metrics available on the Job dashboard include daily cost on jobs, job count by workspace, and jobs running in interactive clusters. Metrics available on Notebook dashboard include data throughput per notebook path, longest running notebooks, data throughput per notebook path, top notebooks returning a lot of data to the UI, spark actions count, notebooks with the largest records, task count by task type, large tasks count, and count of jobs executing on notebooks.

While these can be an overwhelming number of charts from various sources, a curated list of charts must show 1. Databricks workload types in terms of job compute for data engineers, jobs light compute for data analysis, and all-purpose compute. 2. Consumption based charts in terms of DBU, virtual machine, public ip address, blob storage, managed disk, bandwidth etc., 3. Pricing plans in terms of pay-as-you-go, reservations of DBU, DBCU for 1-3 years and/or on region/duration basis and 4. Tags based in terms of cluster tags, pool tags, and workspace tags. Tags can propagate with clusters created from pools and clusters not created from pools. And 5. Cost calculation in terms of quantity, effective price and effective cost.

Some of this information can come form database schema for overwatch with custom queries such as:
select sku, isActive, any_value(contract_price) * count(*) as cost from overwatch.`DBUcostdetails`
group by sku, isActive
having isActive = true;

These are only some of the ways in which dashboard charts are available.