Cluster computing

Wednesday, October 25, 2023

MLOps:

Most machine learning deployment patterns comprise of two types of deployment patterns – online inference and batch inference. Both demonstrate MLOps principles and best practices when developing, deploying, and monitoring machine learning models at scale. Development and deployment are distinct from one another and although the model may be containerized and retrieved for execution during deployment, it can be developed independent of how it is deployed. This separates the concerns for the development of the model from the requirements to address the online and batch workloads. Regardless of the technology stack and the underlying resources used during these two phases; typically, they are created in the public cloud; this distinction serves the needs of the model as well.
For example, developing and training a model might require significant computing but not so much as when executing it for predictions and outlier detections, activities that are hallmarks of production environments. Even the workloads that make use of the model might vary even from one batch processing stack to another and not just between batch and online processing but the common operations of collecting MELT data, named after metrics, events, logs and traces and associated resources will stay the same. These include GitHub repository, Azure Active Directory, cost management dashboards, Key Vaults, and in this case, Azure Monitor. Resources and the practice associated with them for the purposes of security and performance are being left out of this discussion, and the standard DevOps guides from the public cloud providers call them out.

Online workloads targeting the model via API calls will usually require the model to be hosted in a container and exposed via API management services. Batch workloads, on the other hand, require an orchestration tool to co-ordinate the jobs consuming the model. Within the deployment phase, it is a usual practice to host more than one environment such as stage and production – both of which are served by CI/CD pipelines that flows the model from development to its usage. A manual approval is required to advance the model from the stage to the production environment. A well-developed model is usually a composite handling three distinct model activities – handling the prediction, determining the data drift in features, and determining outliers in the features. Mature MLOps also includes processes for explainability, performance profiling, versioning and pipeline automations and such others. Depending on the resources used for DevOps and the environment, typical artifacts would include dockerfiles, templates and manifests.

While parts of the solution for this MLOps can be internalized by studios and launch platforms, organizations like to invest in specific compute, storage and networking for their needs. Databricks/Kubernetes, Azure ML workspaces and such are used for compute, storage accounts and datastores are used for storage, and diversified subnets are used for networking. Outbound internet connectivity from the code hosted and executed in MLOps is usually not required but it can be provisioned with the addition of a NAT gateway within the subnet where it is hosted.

Tuesday, October 24, 2023

The application of data mining and machine learning techniques to Reverse Engineering of IaC.

An earlier article introduced the notion and purpose of reverse engineering. This article explains why and how IaC and application code can become quite complex and require reverse engineering.

Software organization seldom appears simple and straightforward even for the microservice architecture. With IaC becoming the defacto standard for describing infrastructure and deployments from cross-cutting business objectives, they can become quite complex, multi-layered, differing in their physical and logical organizations, and requiring due diligence in their reverse engineering

The premise for doing this is like what a compiler does in creating a symbol table and maintaining dependencies. We recognize that the symbols as nodes and their dependencies as edges presents a rich graph on which relationships can be superimposed and queried for different insights. These insights help with better representation of the knowledge model. Well-known data mining algorithms can assist with this reverse engineering. Even a basic linear or non-linear ranking of the symbols and thresholding them can be very useful towards representing the architecture.

We cover just a few of the data mining algorithms to begin with and close that with a discussion on machine learning methods including SoftMax classification that can make excellent use of co-occurrence data. Finally, we suggest that this does not need to be a one-pass KDM builder and that the use of pipeline and metrics can be helpful towards incremental or continually enhancing the KDM. The symbol and dependency graph are merely the persistence of information learned which can be leveraged for analysis and reporting such as rendering a KDM.

Types of analysis:

Classification algorithms

Regression algorithms

Segmentation algorithms

Association algorithms

Sequence Analysis Algorithms

Outliers Mining Algorithm

Decision tree

Logistic Regression

Neural Network

Naïve Bayes algorithm

Plugin Algorithms

Simultaneous classifiers and regions-of-interest regressors

Collaborative filtering

Collaborative Filtering via Item-based filtering

Hierarchical clustering

NLP algorithms

Where Lucene search indexes and symbol store fail, the data mining insights to code organizations makes up for elaborate knowledge model.

Monday, October 23, 2023

This is a continuation of previous articles on Azure Databricks instances. Azure Databricks is a fast and powerful Apache Spark based analytics service that makes it easy to rapidly develop and deploy Big Data analytics and artificial intelligence solutions. One can author models in Azure Databricks and register them with mlflow during experimentation for prediction, drift detection and outlier detection purposes. Databricks monitoring is as complicated as it is important, partly because there are so many destinations for charts and dashboards and some with overlaps. The Azure monitoring and associated dashboard provides Job Latency, Stage Latency, Task Latency, Sum task execution per host, task metrics, cluster throughput, Streaming throughput/latency and resource consumption per executor, executor compute time metrics and shuffle metrics.

Azure’s cost management dashboard for databricks is also helpful to track costs on workspaces and cluster basis which provides a helpful set of utilities for administrative purposes. Tags can be added to workspaces as well as clusters.

Apache Spark UI on Databricks instance also provides information on what’s happening in your application. The Spark UI provides information on input rate for streaming events, processing time, completed batches, batch details, job details, task details and so on. The driver log is helpful for exceptions and print statements while the executor log is helpful to detect errant and runaway tasks.

But the most comprehensive set of charts and graphs are provided by Overwatch.

Overwatch can be considered an analytics project over Databricks. It collects data from multiple data sources such as APIs and cluster logs, enriches and aggregates the data and comes with little or no cost.

Overwatch provides categorized Dashboard pages for Workspace, clusters, job and notebooks. Metrics available on the workspace dashboard include daily cluster spend chart, cluster spend on each workspace, DBU cost vs compute cost, cluster spend by type on each workspace, cluster count by type on each workspace, and count of scheduled jobs on each workspace. Metrics available on the cluster dashboard include DBU spend by cluster category, DBU spend by the most expensive cluster per day per workspace, top spending clusters per day, DBU spend by the top 3 expensive clusters, cluster count in each category, cluster node-type breakdown, cluster node-type breakdown by the potential, and cluster node potential breakdown by cluster category. Metrics available on the Job dashboard include daily cost on jobs, job count by workspace, and jobs running in interactive clusters. Metrics available on Notebook dashboard include data throughput per notebook path, longest running notebooks, data throughput per notebook path, top notebooks returning a lot of data to the UI, spark actions count, notebooks with the largest records, task count by task type, large tasks count, and count of jobs executing on notebooks.

While these can be an overwhelming number of charts from various sources, a curated list of charts must show 1. Databricks workload types in terms of job compute for data engineers, jobs light compute for data analysis, and all-purpose compute. 2. Consumption based charts in terms of DBU, virtual machine, public ip address, blob storage, managed disk, bandwidth etc., 3. Pricing plans in terms of pay-as-you-go, reservations of DBU, DBCU for 1-3 years and/or on region/duration basis and 4. Tags based in terms of cluster tags, pool tags, and workspace tags. Tags can propagate with clusters created from pools and clusters not created from pools. And 5. Cost calculation in terms of quantity, effective price and effective cost.

Some of this information can come form database schema for overwatch with custom queries such as:
select sku, isActive, any_value(contract_price) * count(*) as cost from overwatch.`DBUcostdetails`
group by sku, isActive
having isActive = true;

These are only some of the ways in which dashboard charts are available. 

Sunday, October 22, 2023

This is a summary of the book “How to stay smart in a smart world – why human intelligence still beats algorithms” written by Gerd Gigerenzer and published by MIT Press 2022. He is a psychologist known for his work on bounded rationality and directs the Harding Center for Risk Literacy at the University of Potsdam. He is also a partner at Simply Rational – The Decision Institute.

Recent advances in artificial intelligence have juxtaposed a different form of intelligence to ours and poses a question about the role of either intelligence. With the spectrum of reactions ranging from embracing it openly to being apprehensive about its prevalence or dominance, the author picks out a cautious approach playing on the strengths and avoiding the weaknesses. With several examples and case studies, he argues that one form of intelligence works well in stable environments with well-defined rules while the other will never lose its relevance outside that world.

The salient points from this book include assertions that AI excels in stable environments and follows rules dictated by humans, AI systems don’t perform well in dynamic environments, filled with uncertainty. Humans must try out AI to get best results. In unexplored territory, simple and transparent algorithms perform better than complex ones. Among the negative impacts, ad-based model from social media platforms can be cited. It’s possible to separate human interaction with machine supervision with clear demarcation. For example, self-driving cars could be given their own dedicated lanes where possible. Market hype and profit incentives can lead companies to overcompromise and underdeliver on digital technologies.

AI wins hands down in many games such as chess, Go etc because it learns the game rules that are fixed, it is tuned by human experts and uses brute calculation to determine the best possible move. The better defined and more stable the premise, the better the performance. The flip side is self-evident with facial recognition for instance that works 99.6% of the time. In dynamic environments, the number drops significantly. When UK police scanned the faces of 170000 soccer fans in a stadium for matches with criminal database, 93% of the matches were false.

AI is good at making correlations with huge amounts of data, even some that would have escaped humans, but it cannot recognize scenarios and deal with ambiguity. For example, Maine’s divorce rate and the United States’ per capita consumption of margarine have a significant correlation but it makes no sense. Its these false findings by AI that makes them even harder to replicate leading to a lot of waste and error in areas such as health science and biotechnology and to the tune of hundreds of billions of dollars. Assertions made today such as eat blueberries to prevent memory loss, eat bananas to get higher verbal SAT score, eat kiwis late at night to sleep better etc may just be the opposite in due time.

Whenever the effectiveness of AI decreases, human intervention can significantly boost their performance. The human brain has a remarkable ability to adapt to constantly changing cues, contexts and situations in what is termed as vicarious functioning. Staying smart means leveraging singularity capabilities but staying in charge. AI lacks four components of common sense – a capacity to think casually, an awareness of others’ intentions and feelings, a basic understanding of space, time and objects, and a longing to join in group norms. Some tasks like recommending the nearest restaurant do not need common sense but the detection of a person crossing the road in a war zone as a threat requires it.

Complex problems do not justify complex solutions. Google Flu trends tried to predict the spread of flu with approximately 160 search terms but they still overpredicted doctors’ visits. In comparison, an algorithm from Max Planck Institute for human development simply used one data point: recent visits to the doctor from the CDC website and performed much better in predicting the flu’s spread.

Information when served subliminally or unknowingly have potential to alter our behavior. This is why ad-based model for social media can be harmful by creating distractions. With attention control technology, the user is held captive by these algorithms. Texting while driving has caused 3000 deaths per year in the United States between 2010 and 2020. In areas other than driving, smartphones have proven to be very distracting.

Finally, the business aspect of artificial intelligence must be realized in the context of historical trends with killer technologies and the commerce behind it. The author says we should be able to profit from AI but not be easily misled with expectations and predictions.

Earlier book summaries: BookSummary10.docx

Saturday, October 21, 2023

This is a continuation of articles on Infrastructure-as-code aka IaC for short.  There’s no denying that IaC can help to create and manage infrastructure and that they can be versioned, reused, and shared – all of which helps to provision resources quickly and consistently and manage them consistently throughout their lifecycle. Unlike software product code that must be general purpose and provide a strong foundation for system architecture and aspiring to be a platform for many use cases, IaC often varies a lot and must be manifested in different combinations depending on environment, purpose and scale and encompass complete development process. It can even include CI/CD platform, DevOps, and testing tools. The DevOps based approach is critical to rapid software development cycles. This makes IaC spread over in a variety of forms. The more articulated the IaC the more predictable and cleaner the deployments. 

One of the greatest advantages of using IaC is its elaborate description of the customizations to a resource type for widespread use within an organization. Among the reasons for customization, company policy enforcement, security and consistency can be called out as the main ones. For example, an Azure ML workspace might require some features to be allowed and others to be disallowed before other members of the organization can start making use of it.

There are several ways to do this. Both ARM Templates and azure cli commands come with directives to turn off features. In fact, the ‘az feature’ command line options are available on a provider-by-provider basis to register or unregister specific features. This ability is helpful separate out the experimental from the production feature set and allows them to be made available independently. Plenty of documentation around the commands and the listing of all such features makes it easy to work on one or more of them directly.

Another option is to display all the configuration corresponding to a resource once it has been selectively turned on from the portal. Since each resource type comes with its own ARM templates as well as command set, it is easy to leverage the built-in ‘show’ command to list all properties of the resource type and then edit some of those properties for a different deployment by virtue of the ‘update’ command. Even if all the properties are not listed for a resource, it is possible to find them in the documentation or by querying many instances of the same resource type.

A third option is to list the operations available on the resource and then set up role-based access control limiting some of those. This approach is favored because users can be switched between roles without affecting the resource or its deployment. It also works for groups and users can be added or removed from both groups and roles. Listing the operations enumerates the associated permissions all of which begin with the provider as the prefix. This listing is thorough and covers all aspect of working with the resources. The custom-role is described in terms of permitted ‘actions’, ‘data-actions’ and ‘not-actions’ where the first two correspond to control and data plane associated actions and the last one corresponds to deny set that takes precedence over control and data plane actions. By appropriately selecting the necessary action privileges and listing them under a specific category without the categories overlapping, we create the custom role with just the minimum number of privileges needed to complete a set of selected tasks.

Last but not the least approach, is to supply an init script with the associated resource, so that as other users start using it, the init script will set the predecided configuration that they must work with. This allows for some degree of control on sub resources and associated containers necessary for an action to complete so that by virtue of removing those resources, an action even if permitted by a role on a resource type, may not be permitted on a specific resource.

IaC is not an immutable asset once it is properly authored. It must be maintained just as any other source code asset. Part of the improvements come from fixes to defects, design changes but in the case of IaC specifically, there are other changes coming from drift detection and cloud security posture management aka CSPM.

Reference: Earlier articles on IaC shortcomings and resolutions: IacResolutionsPart35.docx

Friday, October 20, 2023

This is a summary of book “Chip War – The fight for world’s most critical technology” by Chris Miller and published by Scribner in 2022. He teaches international history at Tufts University’s Fletcher School of Law and Diplomacy.

Even if it is tiny, silicon chips or integrated circuits are critical to strong national economies and modern military. Its history can be traced to 1960 when US firms started making chips in Taiwan. Now that the island’s future and the possibility of conflict with China may depend on whether China reduces its reliance on imported chips and other tech. The miniaturization and fabrication of silicon chips with the help of semiconductors is a modern-day engineering marvel. It has a rich tradition of innovations and while they have been incubated in the west, they have often been produced in the east. Taiwan Semiconductor Manufacturing Company aka TSMC in Taiwan handles much of the production of chips and is a subsidized success story. Notably China still depends on the tech products designed in the Silicon Valley.

The author says China was disadvantaged, by the government’s desire not to build connections to Silicon Valley, but to break free of it. Silicon Valley remains the epicenter of the chip industry. Since 1963 when Fairchild firm began to outsource fabrication to East Asia, innovations have fueled the chip industry. Gordon Moore, the co-founder of Fairchild, predicted the growth rate in chip power in terms of the maximum number of transistors on a single computer chip and that it would double every year between 1965 to 1975. This “Moore’s Law” has proven true for more than 50 years now.

After Bob Noyce and Gordon Moore founded the semiconductor company Intel in Silicon Valley, the first product was launched with a microprocessor or a computer-on-a-chip which included both logic and memory units. Initially, the US government advocated chip production to be sent to Japan as part of redevelopment initiatives, but when Japan surpassed the chip production of US in 1986, the posture was reversed. Cheaper alternatives were provided by Samsung whose leader Lee Byung-Chul consolidated demand from Silicon Valley and made chips under their brand names.

While Silicon Valley businessman William Perry introduced microprocessors to US military through DARPA when he joined as Undersecretary for Defense, Lynn Conway and Carver Mead developed a rules-based foundation for software that automates the task of designing chips. DARPA started financing a program that helped researchers to design chips for production, giving US a military advantage and keeping Moore’s law alive. The Russians had increased espionage to copy designs from Silicon Valley, but the mass production and adoption failed and the copy-it strategy backfired handing back the technological lead to US.

TSMC had a strong tradition of being a government backed success story. Globalization of chip fabrication had not occurred, but Taiwanization had. Morris Chang left Texas Instruments to take charge of the chip industry in Taiwan and envisioned a company that would fabricate chips its customers designed which would be a “foundry” serving “fabless” chips.

John Carruthers, an Intel R&D leader realized that new lithography tools were required to drive the next wave of Moore’s law and used Extreme UV light to make chips. Intel never came up with its own EUV lithography tools but neither did its rivals Nikon and Canon from Japan. The Dutch company ASML became the sole producer of EUV lithography tools. Today, building an advanced logic fab costs twenty billion dollars, a high barrier to entry for many US and global firms. “Fabless” chip firms that design semiconductors and outsource production to TSMC and other foundries have proliferated since the late 1980s. Apple has gained more than any other company from this outsourcing trend. Today only Samsung and TSMC produce most of the sophisticated processors. The Chinese government devised a plan named Made in China 2025 with a goal of increasing domestic chip production and reducing reliance on imported chips over ten years ending in 2025. The Chinese state owns and finances many entities that appear as private equity investment firms but constitute a collective effort to “seize foreign chip firms”. In this way, China seems to be playing a bigger role in producing non-cutting-edge logic chips.

As Chinese military stepped up development and display of technologically sophisticated weapons, the US Administration banned export of US chips to Huawei, a move that devastated the company. TSMC has agreed to open a fab in US but the national security officials would prefer to match capital expenditures in all geographies by TSMC. Until that happens, the world’s dependency on Taiwan deepens.

Thursday, October 19, 2023

This is a continuation of previous posts on Infrastructure-as-code aka IaC for short. There’s no denying that IaC can help to create and manage infrastructure and that they can be versioned, reused, and shared – all of which helps to provision resources quickly and consistently and manage them consistently throughout their lifecycle. Unlike software product code that must be general purpose and provide a strong foundation for system architecture and aspiring to be a platform for many use cases, IaC often varies a lot and must be manifested in different combinations depending on environment, purpose and scale and encompass complete development process. It can even include CI/CD platform, DevOps, and testing tools. The DevOps based approach is critical to rapid software development cycles. This makes IaC spread over in a variety of forms. The more articulated the IaC the more predictable and cleaner the deployments.

CSPM provides us with hardening guidance that helps to improve the security of the cloud deployment and is provided through the user interface for the Microsoft Defender for the cloud which is a cloud native application protection platform with a set of security measures and practices designed to protect cloud-based applications from various cyber threats and vulnerabilities efficiently and effectively. In addition to CSPM, Defender provides solutions for DevSecOps as security management at code level across cloud subscriptions and multiple pipeline environments. Defender also provides a Cloud workload protection platform aka CWPP with specific protections for servers, containers, storage, databases, and other workloads.

The CSPM capabilities are free. It includes asset discovery, continuous assessment and security recommendations for posture hardening, compliance with Microsoft Cloud Security Benchmark (MCSB) and a secure score which measures the status of the organization’s posture. Optional CSPM plan options include attack path analysis, cloud security explorer, advanced threat hunting, security governance capabilities, and tools to assess security compliance with a wide range of benchmarks, regulatory standards, and any custom security policies required by the organization, industry, or region.

Azure Policies are routinely deployed by organizations as a catch-all when IaC does not enforce consistency or when changes beyond IaC introduce security risks. CSPM allows us to define security conditions that customize a security policy. The policy translates to recommendations that identify rsource configurations that violate the security policy. Summarizing all the security postures based on the security recommendations results in a Secure score. It also provides a dashboard to see the weaknesses in the security posture.

Fixes for the recommendations that come from CSPM can work their way into the IaC at the pace that suits the organization. Care must be taken to rank the recommendations and their fixes based on priority and severity. Each recommendation is not a hard-and-fast rule and organizations may take steps to achieve the same result as what’s prescribed in the recommendation by other means. For example, many resources may be asked to turn on dedicated private networking, but organizations may find it simpler and easier to retain public networking but reduce callers to be within a set of IP ranges as called out by CIDRs. Another approach that works for organizations is to find out the set of fixes that boost the Secure score so that those fixes can be made earlier than others.

While going through all the recommendations may be exhaustive and time consuming, it is preferable to address those that are high priority or severity. There are many aspects that can determine these and a buy-in from all the stakeholders might be helpful in this regard.