Cluster computing: October 2023

Tuesday, October 31, 2023

The essence of copywriting:

Copywriting can be considered a content production strategy. Only in copywriting, the goal is to convince the reader to take a specific action and achieve it with its persuasive character, using triggers to arouse readers’ interest, to generate conversations and sales. Copyrighting is also an essential part of digital marketing strategy with potential to increase brand awareness, generate higher-quality leads, and acquire new customers. Good copywriting articulates the brand’s messaging and image while tuning into the target audience.

Kristen Fischer, says in her book, “When Talent isn’t enough – business basics for the creatively inclined” that most creative professionals and scholars can succeed in conducting business on their talent, if they just know how to create their business blueprint that spells all their goals. She is writer and freelance expert herself and recognizes that many creative people are unable to sell their talents as good as how they write, paint, draft, design, or program. This is even more important when their business venture is about entrepreneurship or the advancement of career or professional growth. Artistic or scholarly capabilities do not suffice. Business know-how is about delivering quality work and superior customer service. A business might be known by a name, a resume, a website and dossiers and emails, communications, correspondence, and newsletters are valuable marketing tools. While her book talks about baseline hourly rate and moonlighting as an excellent way to test the freelance market, she is all about articulating what one can accept and not accept to draw lines with the clients. It is this deep understanding that also promotes business.

Being good at business is being creative too. Connecting with a prospective client and dotting all the i’s and crossing all the t’s will promote one’s work and make the engagement rewarding. Keeping records will help with an understanding both for oneself as well as for prospective clients. Its this difference that makes a profession set apart from a hobby for creative people. One example that explains this difference is that a well written website or business brochure does not equal good leads and targeted marketing. One can also differentiate from the competition by virtue of speaking, listening, writing, coaching, analyzing, meditating, and networking. Staying in business means always playing on these strengths to their full.

A business blueprint is about strategy and the right business model. It outlines the business objectives, the marketing strategies, the legal needs, a profile of ideal clients, the location to target, and how to go about the transactions. Knowing who to impress is half the ground in a marketing campaign. Always keep the contact information handy in any information about the work such as a portfolio. A bubbly personality and the ability to carry a conversation can help even with cold-calling. Copywriter Julie Cortes says a client could either love or hate your work. That’s their opinion and they are entitled to it. Contracts and customer service often trumps talent.

These are some of the ways in which creative people can improve their business skills.

Sunday, October 29, 2023

This is a summary of the book “The Cold Start Problem” written by Andrew Chen and published by Harper Business 2021. The author is a general partner for Andreessen Horowitz who explores the network effects behind the growth and success of several companies such as Reddit, Microsoft, Uber, YouTube and Craigslist. He presents a detailed study of network effects with examples and insights. A startup in the planning stage will find many useful tips in this book. Once a startup reaches “escape velocity”, Chen provides guidance for strong results.

Network effects stand in contrast to product feature developments. Traditionally, makers of technology goods have focused on building more and better features based on how users use their products. Instead, networked products focus on user interactions and grow by attracting more users. There are three types of network effects that can drive a product’s success:

1. The acquisition effect where user population increases through viral growth, building the company’s economic base.

2. The engagement effect is where users increase their involvement as the network expands. When the products scale, re-engaging lapsed users become a powerful driver.

3. The economic effect where growth kicks in as monetization and revenue per user increases.

Ecology may provide a framework for understanding network growth. A critical mass or “Allee Threshold” which is the “Tipping point” is the critical number in a social animal group. Below this number, survival prospects wane.

Growing a user base is a different focus than building software. An established competitor can duplicate the features of a startup by capturing its network in another matter altogether. Network effects impel growth and provide competitive advantage.

Losing initial customers because the new product lacks customers is called the cold start problem. For example, the number of rideshare drivers in a city is critical. If riders must wait for half an hour for a ride, the rideshare company is not providing value. Adding more drivers increases customers.

Networked products focus on experiences that users have with each other while traditional products focus on how users interact with the software itself. Cold start problem can be overcome by building an “atomic network” before we launch a new product. An atomic network is the smallest possible self-sustaining network. Building the first network for a startup can be hard but their mainstream relevance even if not apparent, has significance. For example, Tiny Speck was building a game with remote workers who communicated using an archaic Internet Relay Chat technology. Although their game was not successful, it enhanced and adapted its chat tool and named it Slack. The CEO asked friends at other companies to try Slack and while most of them were startups themselves, Slack’s client network expanded and the product gained more features. When it made its debut, the company earned 8000 companies. Within a year, it had 135000 paid subscribers and up to ten thousand daily signups.

Assets also fuel networks. Gaining drivers of competing rideshare companies helped one company gain advantage over another. Even dating apps look for attractive people as assets for match making. When a new product succeeds, considering who uses it and how they differ from category to category is important. Marketplaces such as eBay and Airbnb must have sellers. The supply of goods being sold must precede and sustain buyer demand.

Network effects happen only on a scale. Zoom, for example, improved incumbents by letting people join with a link and providing high-quality video. When a few people adopted Zoom, it quickly expanded virally. Strategies for building networks include “invite only”, “come for the tool”, and financial incentives. The invite-only feature fueled LinkedIn’s explosive growth. Financial incentives date back to 1888 when Coca-cola offered a coupon for a free coke. The author says hustle and creativity help tip over markets because each atomic network is not the same. Also, when a product reaches scale, negative forces may impede further expansion. Forces that undermine growth include churn, market saturation, regulatory measures, trolling, spamming and fraud. Networks also suffer from overcrowding. Applying algorithms that optimize use according to engagement may result in controversial content and new competitors may try cherry-picking from an incumbent. Finally, growth does not continue forever. But network effects are powerful, and they are undeniable drive factors for growth and success.

Saturday, October 28, 2023

This is a continuation of articles on Infrastructure-as-code aka IaC for short.  There’s no denying that IaC can help to create and manage infrastructure and that they can be versioned, reused, and shared – all of which helps to provision resources quickly and consistently and manage them consistently throughout their lifecycle. Unlike software product code that must be general purpose and provide a strong foundation for system architecture and aspiring to be a platform for many use cases, IaC often varies a lot and must be manifested in different combinations depending on environment, purpose and scale and encompass complete development process. It can even include CI/CD platform, DevOps, and testing tools. The DevOps based approach is critical to rapid software development cycles. This makes IaC spread over in a variety of forms. The more articulated the IaC the more predictable and cleaner the deployments. 

One of the challenges of working with IaC that is somewhat unique to IaC is that authors frequently encounter errors in the ‘apply’ stage of the IaC and do not detect any errors in the ‘plan’ stage of the IaC. This leads to write-once-and-fix-many-times and appears to be unavoidable. The compiler only catches limited set of errors such as when a key is specified instead of an id but whether its only at runtime can an id be tried and found to be correct or not. A guid for a principal id is common for role assignments but whether the guid is appropriate for a particular role assignment depends on the principal to which the guid belongs as well as the intended target. One way to overcome this limitation is to have a pre-production environment where the code can be applied in a similar way. By the nature of the non-production environment maintaining a separate set of resources than the production environment, sometimes, even this is difficult to do. In such cases, some experimentation might be involved where the IaC is applied once to add and again to remove leaving behind a clean slate. Both non-production and production environments are secured with DevOps pipelines so that IaC is pushed to these environments which results in raising a request and following through each time. Fortunately, there is a better way to scope down problematic or suspicious IaC code snippets and try it out in a personal azure subscription. This approach strongly eliminates all doubts and works without the touch points required for pipelines. And since the sandbox is of no concern to the business, it is even facilitated by organizations to work for all employees and by public cloud as free accounts.

Another challenge that routinely requires more experimentation is for applying permissions to managed identities. Every resource can have its own system managed identity but deployments comprising of resources and their dependencies can have a common user managed identity to govern them. In this case, the identity must be granted permission on all those resources. Several built-in roles varying per resource are applicable to the environment, but the principle of least privileges can only be honored by increasing privileges step-by-step. This calls for a gradation in built-in roles to be tried out for successful application deployments.

Similarly, access is also about connectivity, and it might be surprising that 404 https status code can also imply network failure when the error is being translated from an upstream resource. Some resources have mutual exclusivity between public access and private access. Granting public access with restrictions to some ip addresses might be a hybrid approach that works sufficiently enough to secure resources. It is also important to note that Azure services can bypass general deny rules.

These are some resolutions that can be categorized as miscellaneous under the IaC.

Friday, October 27, 2023

Securing compute for azure machine learning workspace:

An Azure machine learning compute instance is a managed cloud-based workstation dedicated to a single owner usually for data analysis. It serves as a fully configured and managed development environment or as a compute target for training of models and inference. Models can be build and deployed using integrated notebooks and tools. A compute instance differs from a compute cluster in that it has a single node.

IT administrators prefer this compute for enterprise readiness capabilities. They leverage IaC or resource manager templates to create instances for users. Using advanced settings or security settings, they can further lockdown the instance such as enabling or disabling the ssh or specifying the subnet for the compute instance. They might also require to prevent users from creating compute themselves. In all these cases, some control is necessary.

One option is to list the operations available on the resource and then setting up role-based access control limiting some of those. This approach is favored because users can be switched between roles without affecting the resource or its deployment. It also works for groups and users can be added or removed from both groups and roles. Listing the operations enumerates the associated permissions all of which begin with the provider as the prefix. This listing is thorough and covers all aspects of working with the resources. The custom-role is described in terms of permitted ‘actions’, ‘data-actions’ and ‘not-actions’ where the first two correspond to control and data plane associated actions and the last one corresponds to deny set that takes precedence over control and data plane actions. By appropriately selecting the necessary action privileges and listing them under a specific category without the categories overlapping, we create the custom role with just the minimum number of privileges needed to complete a set of selected tasks.

Another option is to supply an init script with the associated resource, so that as other users start using it, the init script will set the predefined configuration that they must work with. This allows for some degree of control on sub resources and associated containers necessary for an action to complete so that by virtue of removing those resources, an action even if permitted by a role on a resource type, may not be permitted on a specific resource.

These are some techniques to secure the compute instance for azure machine learning workspace.

Thursday, October 26, 2023

Potential applications of machine learning

The MicrosoftML package provides fast and scalable machine learning algorithms for classification, regression and anomaly detection.

The rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Payment default prediction and Email Spam filtering.
The rxOneClassSVM is used for anomaly detection such as in credit card fraud detection. It is a simple one class support vector machine which helps detect outliers that do not belong to some target class because the training set contains only examples from the target class.
The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for bankruptcy prediction. It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next.
The rxFastForest is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model.
The rxNeuralNet is a neural network implementation that helps with multi class classification and regression. It is helpful for applications say signature prediction, OCR, click prediction. A neural network is a weighted directed graph arranged in layers where the nodes in one layer are connected by a weighted edge to the nodes in another layer. This algorithm tries to adjust the weights on the graph edges based on the training data.
The rxLogisticRegression is a binary and multiclass classification that classifies sentiments from feedback. This is a regular regression model where the variable that determines the category is dependent on one or more independent variables that have a logistic distribution.

Wednesday, October 25, 2023

MLOps:

Most machine learning deployment patterns comprise of two types of deployment patterns – online inference and batch inference. Both demonstrate MLOps principles and best practices when developing, deploying, and monitoring machine learning models at scale. Development and deployment are distinct from one another and although the model may be containerized and retrieved for execution during deployment, it can be developed independent of how it is deployed. This separates the concerns for the development of the model from the requirements to address the online and batch workloads. Regardless of the technology stack and the underlying resources used during these two phases; typically, they are created in the public cloud; this distinction serves the needs of the model as well.
For example, developing and training a model might require significant computing but not so much as when executing it for predictions and outlier detections, activities that are hallmarks of production environments. Even the workloads that make use of the model might vary even from one batch processing stack to another and not just between batch and online processing but the common operations of collecting MELT data, named after metrics, events, logs and traces and associated resources will stay the same. These include GitHub repository, Azure Active Directory, cost management dashboards, Key Vaults, and in this case, Azure Monitor. Resources and the practice associated with them for the purposes of security and performance are being left out of this discussion, and the standard DevOps guides from the public cloud providers call them out.

Online workloads targeting the model via API calls will usually require the model to be hosted in a container and exposed via API management services. Batch workloads, on the other hand, require an orchestration tool to co-ordinate the jobs consuming the model. Within the deployment phase, it is a usual practice to host more than one environment such as stage and production – both of which are served by CI/CD pipelines that flows the model from development to its usage. A manual approval is required to advance the model from the stage to the production environment. A well-developed model is usually a composite handling three distinct model activities – handling the prediction, determining the data drift in features, and determining outliers in the features. Mature MLOps also includes processes for explainability, performance profiling, versioning and pipeline automations and such others. Depending on the resources used for DevOps and the environment, typical artifacts would include dockerfiles, templates and manifests.

While parts of the solution for this MLOps can be internalized by studios and launch platforms, organizations like to invest in specific compute, storage and networking for their needs. Databricks/Kubernetes, Azure ML workspaces and such are used for compute, storage accounts and datastores are used for storage, and diversified subnets are used for networking. Outbound internet connectivity from the code hosted and executed in MLOps is usually not required but it can be provisioned with the addition of a NAT gateway within the subnet where it is hosted.

Tuesday, October 24, 2023

The application of data mining and machine learning techniques to Reverse Engineering of IaC.

An earlier article introduced the notion and purpose of reverse engineering. This article explains why and how IaC and application code can become quite complex and require reverse engineering.

Software organization seldom appears simple and straightforward even for the microservice architecture. With IaC becoming the defacto standard for describing infrastructure and deployments from cross-cutting business objectives, they can become quite complex, multi-layered, differing in their physical and logical organizations, and requiring due diligence in their reverse engineering

The premise for doing this is like what a compiler does in creating a symbol table and maintaining dependencies. We recognize that the symbols as nodes and their dependencies as edges presents a rich graph on which relationships can be superimposed and queried for different insights. These insights help with better representation of the knowledge model. Well-known data mining algorithms can assist with this reverse engineering. Even a basic linear or non-linear ranking of the symbols and thresholding them can be very useful towards representing the architecture.

We cover just a few of the data mining algorithms to begin with and close that with a discussion on machine learning methods including SoftMax classification that can make excellent use of co-occurrence data. Finally, we suggest that this does not need to be a one-pass KDM builder and that the use of pipeline and metrics can be helpful towards incremental or continually enhancing the KDM. The symbol and dependency graph are merely the persistence of information learned which can be leveraged for analysis and reporting such as rendering a KDM.

Types of analysis:

Classification algorithms

Regression algorithms

Segmentation algorithms

Association algorithms

Sequence Analysis Algorithms

Outliers Mining Algorithm

Decision tree

Logistic Regression

Neural Network

Naïve Bayes algorithm

Plugin Algorithms

Simultaneous classifiers and regions-of-interest regressors

Collaborative filtering

Collaborative Filtering via Item-based filtering

Hierarchical clustering

NLP algorithms

Where Lucene search indexes and symbol store fail, the data mining insights to code organizations makes up for elaborate knowledge model.

Monday, October 23, 2023

This is a continuation of previous articles on Azure Databricks instances. Azure Databricks is a fast and powerful Apache Spark based analytics service that makes it easy to rapidly develop and deploy Big Data analytics and artificial intelligence solutions. One can author models in Azure Databricks and register them with mlflow during experimentation for prediction, drift detection and outlier detection purposes. Databricks monitoring is as complicated as it is important, partly because there are so many destinations for charts and dashboards and some with overlaps. The Azure monitoring and associated dashboard provides Job Latency, Stage Latency, Task Latency, Sum task execution per host, task metrics, cluster throughput, Streaming throughput/latency and resource consumption per executor, executor compute time metrics and shuffle metrics.

Azure’s cost management dashboard for databricks is also helpful to track costs on workspaces and cluster basis which provides a helpful set of utilities for administrative purposes. Tags can be added to workspaces as well as clusters.

Apache Spark UI on Databricks instance also provides information on what’s happening in your application. The Spark UI provides information on input rate for streaming events, processing time, completed batches, batch details, job details, task details and so on. The driver log is helpful for exceptions and print statements while the executor log is helpful to detect errant and runaway tasks.

But the most comprehensive set of charts and graphs are provided by Overwatch.

Overwatch can be considered an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost.

Overwatch provides categorized Dashboard pages for Workspace, clusters, job and notebooks. Metrics available on the workspace dashboard include daily cluster spend chart, cluster spend on each workspace, DBU cost vs compute cost, cluster spend by type on each workspace, cluster count by type on each workspace, and count of scheduled jobs on each workspace. Metrics available on the cluster dashboard include DBU spend by cluster category, DBU spend by the most expensive cluster per day per workspace, top spending clusters per day, DBU spend by the top 3 expensive clusters, cluster count in each category, cluster node-type breakdown, cluster node-type breakdown by the potential, and cluster node potential breakdown by cluster category. Metrics available on the Job dashboard include daily cost on jobs, job count by workspace, and jobs running in interactive clusters. Metrics available on Notebook dashboard include data throughput per notebook path, longest running notebooks, data throughput per notebook path, top notebooks returning a lot of data to the UI, spark actions count, notebooks with the largest records, task count by task type, large tasks count, and count of jobs executing on notebooks.

While these can be an overwhelming number of charts from various sources, a curated list of charts must show 1. Databricks workload types in terms of job compute for data engineers, jobs light compute for data analysis, and all-purpose compute. 2. Consumption based charts in terms of DBU, virtual machine, public ip address, blob storage, managed disk, bandwidth etc., 3. Pricing plans in terms of pay-as-you-go, reservations of DBU, DBCU for 1-3 years and/or on region/duration basis and 4. Tags based in terms of cluster tags, pool tags, and workspace tags. Tags can propagate with clusters created from pools and clusters not created from pools. And 5. Cost calculation in terms of quantity, effective price and effective cost.

Some of this information can come form database schema for overwatch with custom queries such as:
select sku, isActive, any_value(contract_price) * count(*) as cost from overwatch.`DBUcostdetails`
group by sku, isActive
having isActive = true;

These are only some of the ways in which dashboard charts are available.