Cluster computing

Monday, June 3, 2024

Problem::

Make Array Zero by Subtracting Equal Amounts

You are given a non-negative integer array nums. In one operation, you must:

• Choose a positive integer x such that x is less than or equal to the smallest non-zero element in nums.

• Subtract x from every positive element in nums.

Return the minimum number of operations to make every element in nums equal to 0.

Example 1:

Input: nums = [1,5,0,3,5]

Output: 3

Explanation:

In the first operation, choose x = 1. Now, nums = [0,4,0,2,4].

In the second operation, choose x = 2. Now, nums = [0,2,0,0,2].

In the third operation, choose x = 2. Now, nums = [0,0,0,0,0].

Example 2:

Input: nums = [0]

Output: 0

Explanation: Each element in nums is already 0 so no operations are needed.

Constraints:

• 1 <= nums.length <= 100

• 0 <= nums[i] <= 100

import java.util.*;

import java.util.stream.*;

class Solution {

public int minimumOperations(int[] nums) {

List<Integer> list = Arrays.stream(nums).boxed().collect(Collectors.toList());

var nonZero = list.stream().filter(x -> x > 0).collect(Collectors.toList());

int count = 0;

while(nonZero.size() > 0) {

var min = nonZero.stream().mapToInt(x -> x).min().getAsInt();

nonZero = nonZero.stream().map(x -> x - min).filter(x -> x > 0).collect(Collectors.toList());

count++;

}

return count;

}

Input

nums =

[1,5,0,3,5]

Output

Expected

Input

nums =

[0]

Output

Expected

SQL Schema

Table: Books

+----------------+---------+

| Column Name | Type |

+----------------+---------+

| book_id | int |

| name | varchar |

| available_from | date |

+----------------+---------+

book_id is the primary key of this table.

Table: Orders

+----------------+---------+

| Column Name | Type |

+----------------+---------+

| order_id | int |

| book_id | int |

| quantity | int |

| dispatch_date | date |

+----------------+---------+

order_id is the primary key of this table.

book_id is a foreign key to the Books table.

Write an SQL query that reports the books that have sold less than 10 copies in the last year, excluding books that have been available for less than one month from today. Assume today is 2019-06-23.

Return the result table in any order.

The query result format is in the following example.

Example 1:

Input:

Books table:

+---------+--------------------+----------------+

| book_id | name | available_from |

+---------+--------------------+----------------+

| 1 | "Kalila And Demna" | 2010-01-01 |

| 2 | "28 Letters" | 2012-05-12 |

| 3 | "The Hobbit" | 2019-06-10 |

| 4 | "13 Reasons Why" | 2019-06-01 |

| 5 | "The Hunger Games" | 2008-09-21 |

+---------+--------------------+----------------+

Orders table:

+----------+---------+----------+---------------+

+----------+---------+----------+---------------+

| 1 | 1 | 2 | 2018-07-26 |

| 2 | 1 | 1 | 2018-11-05 |

| 3 | 3 | 8 | 2019-06-11 |

| 4 | 4 | 6 | 2019-06-05 |

| 5 | 4 | 5 | 2019-06-20 |

| 6 | 5 | 9 | 2009-02-02 |

| 7 | 5 | 8 | 2010-04-13 |

+----------+---------+----------+---------------+

Output:

+-----------+--------------------+

| book_id | name |

+-----------+--------------------+

| 1 | "Kalila And Demna" |

| 2 | "28 Letters" |

| 5 | "The Hunger Games" |

+-----------+--------------------+

SELECT DISTINCT b.book_id, b.name

FROM books b

LEFT JOIN Orders o on b.book_id = o.book_id

GROUP BY b.book_id, b.name,

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date),

DATEDIFF(day, b.available_from, DATEADD(month, -1, '2019-06-23'))

HAVING SUM(o.quantity) IS NULL OR

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) < 0 OR

(DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) > 0 AND DATEDIFF(day, b.available_from, DATEADD(month, -1, '2019-06-23')) > 0 AND SUM(o.quantity) < 10);

Case 1

Input

Books =

| book_id | name | available_from |

| ------- | ---------------- | -------------- |

| 1 | Kalila And Demna | 2010-01-01 |

| 2 | 28 Letters | 2012-05-12 |

| 3 | The Hobbit | 2019-06-10 |

| 4 | 13 Reasons Why | 2019-06-01 |

| 5 | The Hunger Games | 2008-09-21 |

Orders =

| -------- | ------- | -------- | ------------- |

| 1 | 1 | 2 | 2018-07-26 |

| 2 | 1 | 1 | 2018-11-05 |

| 3 | 3 | 8 | 2019-06-11 |

| 4 | 4 | 6 | 2019-06-05 |

| 5 | 4 | 5 | 2019-06-20 |

| 6 | 5 | 9 | 2009-02-02 |

| 7 | 5 | 8 | 2010-04-13 |

Output

| book_id | name |

| ------- | ---------------- |

| 2 | 28 Letters |

| 1 | Kalila And Demna |

| 5 | The Hunger Games |

Expected

| book_id | name |

| ------- | ---------------- |

| 1 | Kalila And Demna |

| 2 | 28 Letters |

| 5 | The Hunger Games |

Sunday, June 2, 2024

This is a continuation of several articles on openai search for Drone formation organization using elements as reference locations and nodes as predicted positions for drones. The elements can be stored in any non-proprietary vector database and a sample implementation would look something as follows and also called out in: https://github.com/ravibeta/semantic_search

The first step would be to install all the required packages and libraries. We use Python in this sample:

import warnings

warnings.filterwarnings(‘ignore’)

from datasets import load_dataset

from pinecone import Pinecone, ServerlessSpec

from DLAIUtils import Utils

import DLAIUtils

import os

import time

import torch

From tqdm.auto import tqdm

We assume the elements are mapped as embeddings in a 384-dimensional dense vector space.

A sample query would appear like this:

query = `what is node nearest this element?`

xq = model.encode(query)

xq.shape

(384,)

The next step is to set up the Pinecone vector database to upsert embeddings into it. These database index vectors make search and retrieval easy by comparing values and finding those that are most like one-another

utils = Utils()

PINECONE_API_KEY = utils.get_pinecone_api_key()

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:

pinecone.delete_index(INDEX_NAME)

print(INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=model.get_sentence_embedding_dimension(), metric=’cosine’,spec=ServerlessSpec(cloud=’aws’, region=’us-west-2’))

index = pinecone.Index(INDEX_NAME)

print(index)

Then, the next step is to create embeddings for all the elements in the sample space and upsert them to Pinecone.

batch_size=200

vector_limit=10000

elements=element[:vector_limit]

import json

for i in tqdm(range(0, len(elements), batch_size)):

i_end = min(i+batch_size, len(elements))

ids = [str(x) for x in range(i, i_end)]

metadata = [{‘text’: text} for text in elements[i:i_end]]

xc = model.encode(elements[i:i_end])

records = zip(ids, xc, metadata)

index.upsert(vectors=records)

index.describe_index_stats()

Then the query can be run on the embeddings and the top matches can be returned.

def run_query(query):

embedding = model.encode(query).tolist()

results = index.query(top_k=10, vector=embedding, include_metadata=True, include_value)

for result in results[‘matches’]:

print(f”{round(result[‘score’], 2)}: {result[‘metadata’][‘node’]}”)

run_query(“what is node nearest this element?”)

With this, the embeddings-based search over elements is ready. In Azure, cosmos DB offers a similar semantic search and works as a similar vector database.

The following code outlines the steps using Azure AI Search

# configure the vector store settings, vector name is in the index of the search

endpoint: str = "<AzureSearchEndpoint>"

key: str = "<AzureSearchKey>"

index_name: str = "<VectorName>"

credential = AzureKeyCredential(key)

client = SearchClient(endpoint=endpoint,

index_name=index_name,

credential=credential)

# create embeddings

embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(

azure_deployment=azure_deployment,

openai_api_version=azure_openai_api_version,

azure_endpoint=azure_endpoint,

api_key=azure_openai_api_key,

)

# create vector store

vector_store = AzureSearch(

azure_search_endpoint=endpoint,

azure_search_key=key,

index_name=index_name,

embedding_function=embeddings.embed_query,

)

# create a query

docs = vector_store.similarity_search(

query=userQuery,

k=3,

search_type="similarity",

)

collections.insert_many(docs)

Saturday, June 1, 2024

Automation can also be achieved with Azure Data Factory aka ADF and a self-hosted integration runtime that comprises of a vm hosted on-premises and a Script activity. While typically associated with Data Transformation activities, a self-hosted integration runtime can participate in running any scripts and its invocation from ADF guarantees human and programmatic access from anywhere that has cloud connectivity. A self-hosted integration runtime is a component that connects data sources on-premises/ on Azure VM with cloud services in a secure and managed way

The Json syntax for defining a script looks something like this:

{

"name": "<activity name>",

"type": "Script",

"linkedServiceName": {

"referenceName": "<name>",

"type": "LinkedServiceReference"

"typeProperties": {

"scripts" : [

{

"text": "<Script Block>",

"type": "<Query> or <NonQuery>",

"parameters":[

{

"name": "<name>",

"value": "<value>",

"type": "<type>",

"direction": "<Input> or <Output> or <InputOutput>",

"size": 256

...

]

...

]

"scriptBlockExecutionTimeout": "<time>",

"logSettings": {

"logDestination": "<ActivityOutput> or <ExternalStore>",

"logLocationSettings":{

"linkedServiceName":{

"referenceName": "<name>",

"type": "<LinkedServiceReference>"

"path": "<folder path>"

}

The output can be collected everytime a script block is executed. There is a 5000 rows/4MB size limit but this is sufficient for most purposes.

A sample curl call would be something like this:

##! /usr/bin/python

import requests

# Set your ADF details

subscription_id = '<subscription_id>'

resource_group = '<resource_group>'

factory_name = '<factory_name>'

# Set the pipeline name you want to trigger

pipeline_name = 'your_pipeline_name'

# Construct the API URL

api_url = f"https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.DataFactory/factories/{factory_name}/pipelines/{pipeline_name}/createRun?api-version=2017-03-01-preview"

# Make the POST request

response = requests.post(api_url)

# Check the response status

if response.status_code == 200:

print("Pipeline triggered successfully!")

else:

print(f"Error triggering pipeline. Status code: {response.status_code}")

## EOF

Friday, May 31, 2024

This is a continuation of IaC shortcomings and resolutions. In this section, we focus on the deployment of azure machine learning workspaces with virtual network peerings. When peerings are established traffic from any source in virtual network can flow to any destination in another. This comes very helpful when egress must be from one virtual network. Any number of virtual networks can be peered in hub-and-spoke model or as transit but they have their drawbacks and advantages. The impact this has on the infrastructure for az ml deployments is usually not called out in deployments and there can be quite a few surprises in the normal functioning of the workspace. Some of the previous articles explained these from the workspace side but in this section, we describe the network side in more detail, specifically the configuration options with peering.

When a local virtual network is peered with a remote virtual network, then there are four options presented to the user out of which only the first is selected and the rest remain unselected. Unfortunately, the default settings are not always appropriate for every situation and deserve special attention. These four options are:

1. Allow local network to access remote network

2. Allow local network to receive forwarded traffic from remote network

3. Allow gateway or route server in local network to forward traffic to remote network

4. Allow local network to use remote network’s gateway or route server.

Now, local and remote are interchangeable and these options repeated for the opposite direction as well with both sections of four choices each appearing on the ‘Add Peering’ page. This gives complete control over all aspects of treating the local and remote network in an asymmetrical manner rather than symmetrical bidirectionally-equal configuration.

Now, let’s revisit the options themselves assuming we have picked one of the networks as local. If the first option is not selected, there is no peering because traffic does not flow at all for the local network. This option is therefore selected by default in both sections and can be overridden selectively by the cloud network contributor role, but seldom done.

The second option is necessary for Microsoft hosts such as login.microsoftonline.com aka Microsoft Entra ID, management.azure.com aka Azure Portal and Azure Resource Manager to reach the local network. Failure to do so will result in incomplete handshakes during authentication as users begin to use resources in the local network.

The third and fourth options are for leveraging egress traffic to use gateway or route server. Often, a designated third remote virtual network was chained behind the remote and local networks for its firewall. When the firewall is enabled configuring the gateway or route server helps to ensure that all resources use that gateway or route server as their next hop. Setting this option allows the local network to use that single gateway or route server for all chained virtual networks. Between the third and the fourth options, the gateway or route server only happens to be in the local or the remote network. They can also be both selected with preference for local as well as remote appliance because third occurs before fourth.

In this way, peering configuration has complete control over the traffic between the participating networks. Traffic can optionally be observed with the help of a network watcher. This completes the discussion around network side and workspace side configuration options for ensuring full connectivity to the compute and successful code execution on those hosts.

Thursday, May 30, 2024

This is a summary of the book titled “Be Data Analytical: How to use analytics to turn data into value” written by Jordan Morrow and published by Kogan Page in 2023. The author is a data expert who empowers organizations by elevating their data literacy levels and supporting an ethos of curiosity and experimentation. He argues that decision making must comprise of both human intuition and data analytics. A data driven culture that supports curiosity and experimentation must be nurtured. Descriptive analytics must capture and communicate meaningful patterns and trends. Outperform your competition with diagnostic analytics to uncover root causes. Explore multiple outcomes with predictive analytics to improve strategic decision making. Build better descriptive, diagnostic, predictive and prescriptive analytics in six steps. Apply your data and analytics mindset to your life.

Data-driven activities involve leveraging data and analytics to assist in decision-making, allowing individuals and organizations to make better data-informed decisions. To improve decision capabilities, progress through four levels of analytics: descriptive, diagnostic, predictive, and prescriptive. Nurture a data-driven culture that supports curiosity and experimentation, aiming to build a "data and analytics mindset" that encourages experimentation and making mistakes.

Data-driven cultures should align with data ethics, embracing transparency and questioning data rigorously. Descriptive analytics can be used to capture and communicate meaningful patterns and trends, with various roles playing in generating data. Data analysts, data scientists, data architects, and leaders can all contribute to generating descriptive analytics.

To create a data-driven culture, embrace the democratization of data, giving everyone access to the information they need. By embracing data ethics, embracing transparency, and fostering a culture of data literacy, organizations can effectively problem-solve effectively with data.

Diagnostic analytics is a crucial tool for organizations to uncover root causes and make informed decisions. It helps organizations understand the reasons behind various phenomena, enabling them to make more informed decisions. This can be achieved using tools like Tableau, Microsoft Power BI, and Qlik, as well as coding languages like R and Python. Predictive analytics is another powerful tool for strategic decision-making, allowing organizations to anticipate supply-chain challenges and forecast credit card delinquency rates. Leaders play a significant role in driving better predictive analytics, requiring data literacy and data-driven decision-making. Data science platforms like RapidMiner can be used to perform predictive analytics, allowing users to understand data visually. While not everyone in the organization will build predictive analytics, democratizing predictions can ensure the right parties have access to the necessary information. Prescriptive analytics, which uses machine learning to make recommendations and create action steps, can also be beneficial. However, it's important to remember that predictions are not prophecies and should be communicated clearly.

Prescriptive analytics is a powerful tool that can be used to make decisions based on patterns and trends. However, it is essential to maintain the human element in analytics, as it allows for the freedom to change your workout regimen and downsize your company. Everyone at your company plays a role in building these analytics, from C-suite executives to data analysts, engineers, and data scientists. To build better analytics, follow six steps:

1. Awareness: Ensure staff are familiar with the four levels of analytics, their problems, and solutions.

2. Understanding: Understand how each phase of data analytics fits within the bigger picture, helping you achieve broader goals.

3. Assessing: Evaluate personal skills and the organization as a whole, identifying gaps to fill.

4. Questioning: Improve each phase of analytics by asking questions about data quality, purpose, and future implications.

5. Learning: Gain data literacy and improve problem-solving abilities.

6. Implementation: Don't waste valuable insights and execute data-informed decisions.

Applying a data and analytics mindset to your life is crucial, as failures present opportunities to improve and refine your approach to data analytics.

Previous book summary: BookSummary99.docx

My writing: MLOps3.docx

Wednesday, May 29, 2024

This is a continuation of articles on IaC shortcomings and resolutions. In this section too, we focus on the deployment of azure machine learning workspaces with virtual network peering and securing it with proper connectivity. When peerings are established between virtual networks and the AZ ML Workspace is secured with a subnet dedicated to the creation of compute, improper settings of private and service endpoints, firewall, NSGs and user-defined routing traffic, may cause quite a few surprises in the normal functioning of the workspace. For example, data scientists may encounter an error as: “Performing interactive authentication. Please follow the instructions on the terminal. To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXYYZZAA to authenticate.” Even if they complete the device login, the resulting message will tell them they cannot be authenticated at this time. Proper configuration of the workspace and the traffic is essential to overcome this error.

One of the main deterrence in the completion of pass-through authentication is the resolution of DNS names and their IP addresses to route the reverse traffic. Since the public plane connectivity is terminated at the workspace, the traffic to and from the compute goes over the private plane. A Private DNS lookup is required for the IP address of the private endpoint to the workspace. When the private endpoint is created, the DNS zone registrations for predetermined domain prefixes and their corresponding private IP addresses as determined by the private endpoint must be registered. This is auto-registered when the endpoint is suitably created, otherwise they must be manually added.

With just the compute and the clusters having private IP connectivity to the subnet, the outbound IP connectivity can be established through the workspace in an unrestricted setting or with a firewall in a conditional egress setting. The subnet that the compute and clusters are provisioned from must have connectivity to the subnet that the storage account, key vault and azure container registry that are internal to the workspace. A subnet can even have its own Nat gateway so that all outbound access can get the same IP address prefix which is very helpful to secure using an IP rule for the prefix for incoming traffic at the destination. Storage account and key vault can gain access via their service endpoints to the compute and cluster’s private IP address while the container registry must have a private endpoint for the private plane connectivity to the compute. A dedicated image server build compute can be created for designated image building activities.

User-defined routing and local hosts file become pertinent when a firewall is used to secure outbound traffic. Local host file with the private ip address of the compute and a name like ‘mycomputeinstance.eastus.instances.azureml.ms’, is an option to connect to the virtual network with the workspace in it. is also important to set user-defined routing when a firewall is used, and the default rule must have ‘0.0.0.0/0’ to designate all outbound internet traffic to reach the private ip address of the firewall as a next hop. This allows the firewall to inspect all outbound traffic and security policies can kick in to allow or deny traffic selectively.

Tuesday, May 28, 2024

This is a summary of the book titled “The AI playbook: mastering the art of machine learning deployment” written by Eric Siegel and published by MIT press in 2024. Prof. Siegel urges business and tech leaders to come out of their silos and collaborate to harness the full potential of machine learning models that will transform their organization and optimize their operations. He provides a step-by-step framework to do that which includes establishing value-driven deployment goal by leveraging “backward planning”, collaborating for a specific prediction goal, finding the right evaluation metrics, preparing the data to achieve desired outcomes, training the model to detect patterns, deploying the models such that there is a full-stack buy-in from stakeholder departments in the organization and committing to a strong ethical compass for maintaining the models.

Machine Learning (ML) opportunities require collaboration between business and data professionals. Business professionals need a holistic understanding of the ML process, including models, metrics, and data collection. Data professionals must broaden their perspective on ML to understand its potential to transform the entire business. BizML, a six-step business approach, bridges gaps between the business and data ends of an organization. It focuses on organizational execution and complements the Cross Industry Standard Process for Data Mining (CRISP-DM). Successful ML and AI projects require "backward planning" to establish a value-driven deployment goal. ML's applications extend beyond predicting business outcomes, addressing social issues like abuse or neglect. After choosing how to apply ML, stakeholders with decision-making power should approve it, focusing on the gains ML can make rather than fixating on the technology.

Business and tech leaders should collaborate to specify a prediction goal for machine learning (ML) projects. This involves defining the goal in detail, identifying viable prediction goals, and adhering to the "Law of ML Planning." Ensure that deployment and the predictions will shape business operations are at the forefront of the project. Consider potential ethical issues, such as the potential for predictive policing models to inflate the likelihood of Black parolees being rearrested.

For new ML projects, consider creating a binary model or binary classifier that makes predictions by answering yes/no questions. Other predictive models, such as numerical or continuous models, can also be used.

Evaluating the model's performance is crucial to determine its success. Accuracy is not the best way to measure the model's success. High accuracy models only perform better than random guessing, and metrics such as "lift" and "cost" should be used to evaluate the model's performance.

To train a machine learning (ML) model, ensure that the data is long, wide, and labeled. This will help the model accurately predict outcomes and identify patterns. Ensure that the data is structured and unstructured and be wary of "noise" or "corrupt data" that may be causing issues.

Teach the ML model to detect patterns in a sensible way, as ML algorithms learn from your data and use patterns to make predictions. Understanding your model is not always straightforward, but if the patterns your model detects and uses to make predictions are reliable, you don't necessarily need to establish causation.

Familiarize yourself with different modeling methods, such as decision trees, linear regression, and logistic regression. Investigate your models to ensure they don't contain bugs, as some models may combine input variables in problematic ways. For example, a model designed to distinguish huskies from wolves using images may label all images with snow as "wolves" when it might be discovered that the model was labeling all images without snow as "huskies."

To deploy an AI model, it's crucial to gain full-stack cooperation and buy-in from all team members within your organization. Building trust in the model is essential, as it can automate decision-making processes. Humans still play a role in some processes, and deploying a "human-in-the-loop" approach allows them to make operational decisions after integrating data from the model. Deployment risk can be mitigated by using a control group or incremental deployment. Maintaining the model is essential to prevent model drift, which can occur when the data used degrades. To avoid discrimination, ensure the model doesn't operate in a discriminatory way, aiming to equally represent different groups and avoid inferring sensitive attributes. Aspire to use data ethically and responsibly, based on empathy.