Cluster computing

Tuesday, June 4, 2024

Subarray Sum equals K

Given an array of integers nums and an integer k, return the total number of subarrays whose sum equals to k.

A subarray is a contiguous non-empty sequence of elements within an array.

Example 1:

Input: nums = [1,1,1], k = 2

Output: 2

Example 2:

Input: nums = [1,2,3], k = 3

Output: 2

Constraints:

• 1 <= nums.length <= 2 * 104

• -1000 <= nums[i] <= 1000

• -107 <= k <= 107

class Solution {

    public int subarraySum(int[] nums, int k) {

        if (nums == null || nums.length == 0) return -1;

        int[] sums = new int[nums.length];  

        int sum = 0;

        for (int i = 0; i < nums.length; i++){

            sum += nums[i];

            sums[i] = sum;

        }

        int count = 0;

        for (int i = 0; i < nums.length; i++) {

            for (int j = i; j < nums.length; j++) {

                int current = nums[i] + (sums[j] - sums[i]);

                if (current == k){

                    count += 1;

                }

        return count;

    }

[1,3], k=1 => 1

[1,3], k=3 => 1

[1,3], k=4 => 1

[2,2], k=4 => 1

[2,2], k=2 => 2

[2,0,2], k=2 => 4

[0,0,1], k=1=> 3

[0,1,0], k=1=> 2

[0,1,1], k=1=> 3

[1,0,0], k=1=> 3

[1,0,1], k=1=> 4

[1,1,0], k=1=> 2

[1,1,1], k=1=> 3

[-1,0,1], k=0 => 2

[-1,1,0], k=0 => 3

[1,0,-1], k=0 => 2

[1,-1,0], k=0 => 3

[0,-1,1], k=0 => 3

[0,1,-1], k=0 => 3

Monday, June 3, 2024

Problem::

Make Array Zero by Subtracting Equal Amounts

You are given a non-negative integer array nums. In one operation, you must:

• Choose a positive integer x such that x is less than or equal to the smallest non-zero element in nums.

• Subtract x from every positive element in nums.

Return the minimum number of operations to make every element in nums equal to 0.

Example 1:

Input: nums = [1,5,0,3,5]

Output: 3

Explanation:

In the first operation, choose x = 1. Now, nums = [0,4,0,2,4].

In the second operation, choose x = 2. Now, nums = [0,2,0,0,2].

In the third operation, choose x = 2. Now, nums = [0,0,0,0,0].

Example 2:

Input: nums = [0]

Output: 0

Explanation: Each element in nums is already 0 so no operations are needed.

Constraints:

• 1 <= nums.length <= 100

• 0 <= nums[i] <= 100

import java.util.*;

import java.util.stream.*;

class Solution {

public int minimumOperations(int[] nums) {

List<Integer> list = Arrays.stream(nums).boxed().collect(Collectors.toList());

var nonZero = list.stream().filter(x -> x > 0).collect(Collectors.toList());

int count = 0;

while(nonZero.size() > 0) {

var min = nonZero.stream().mapToInt(x -> x).min().getAsInt();

nonZero = nonZero.stream().map(x -> x - min).filter(x -> x > 0).collect(Collectors.toList());

count++;

}

return count;

}

Input

nums =

[1,5,0,3,5]

Output

Expected

Input

nums =

[0]

Output

Expected

Problem::

Make Array Zero by Subtracting Equal Amounts

You are given a non-negative integer array nums. In one operation, you must:

• Choose a positive integer x such that x is less than or equal to the smallest non-zero element in nums.

• Subtract x from every positive element in nums.

Return the minimum number of operations to make every element in nums equal to 0.

Example 1:

Input: nums = [1,5,0,3,5]

Output: 3

Explanation:

In the first operation, choose x = 1. Now, nums = [0,4,0,2,4].

In the second operation, choose x = 2. Now, nums = [0,2,0,0,2].

In the third operation, choose x = 2. Now, nums = [0,0,0,0,0].

Example 2:

Input: nums = [0]

Output: 0

Explanation: Each element in nums is already 0 so no operations are needed.

Constraints:

• 1 <= nums.length <= 100

• 0 <= nums[i] <= 100

import java.util.*;

import java.util.stream.*;

class Solution {

public int minimumOperations(int[] nums) {

List<Integer> list = Arrays.stream(nums).boxed().collect(Collectors.toList());

var nonZero = list.stream().filter(x -> x > 0).collect(Collectors.toList());

int count = 0;

while(nonZero.size() > 0) {

var min = nonZero.stream().mapToInt(x -> x).min().getAsInt();

nonZero = nonZero.stream().map(x -> x - min).filter(x -> x > 0).collect(Collectors.toList());

count++;

}

return count;

}

Input

nums =

[1,5,0,3,5]

Output

Expected

Input

nums =

[0]

Output

Expected

SQL Schema

Table: Books

+----------------+---------+

| Column Name | Type |

+----------------+---------+

| book_id | int |

| name | varchar |

| available_from | date |

+----------------+---------+

book_id is the primary key of this table.

Table: Orders

+----------------+---------+

| Column Name | Type |

+----------------+---------+

| order_id | int |

| book_id | int |

| quantity | int |

| dispatch_date | date |

+----------------+---------+

order_id is the primary key of this table.

book_id is a foreign key to the Books table.

Write an SQL query that reports the books that have sold less than 10 copies in the last year, excluding books that have been available for less than one month from today. Assume today is 2019-06-23.

Return the result table in any order.

The query result format is in the following example.

Example 1:

Input:

Books table:

+---------+--------------------+----------------+

| book_id | name | available_from |

+---------+--------------------+----------------+

| 1 | "Kalila And Demna" | 2010-01-01 |

| 2 | "28 Letters" | 2012-05-12 |

| 3 | "The Hobbit" | 2019-06-10 |

| 4 | "13 Reasons Why" | 2019-06-01 |

| 5 | "The Hunger Games" | 2008-09-21 |

+---------+--------------------+----------------+

Orders table:

+----------+---------+----------+---------------+

+----------+---------+----------+---------------+

| 1 | 1 | 2 | 2018-07-26 |

| 2 | 1 | 1 | 2018-11-05 |

| 3 | 3 | 8 | 2019-06-11 |

| 4 | 4 | 6 | 2019-06-05 |

| 5 | 4 | 5 | 2019-06-20 |

| 6 | 5 | 9 | 2009-02-02 |

| 7 | 5 | 8 | 2010-04-13 |

+----------+---------+----------+---------------+

Output:

+-----------+--------------------+

| book_id | name |

+-----------+--------------------+

| 1 | "Kalila And Demna" |

| 2 | "28 Letters" |

| 5 | "The Hunger Games" |

+-----------+--------------------+

SELECT DISTINCT b.book_id, b.name

FROM books b

LEFT JOIN Orders o on b.book_id = o.book_id

GROUP BY b.book_id, b.name,

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date),

DATEDIFF(day, b.available_from, DATEADD(month, -1, '2019-06-23'))

HAVING SUM(o.quantity) IS NULL OR

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) < 0 OR

(DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) > 0 AND DATEDIFF(day, b.available_from, DATEADD(month, -1, '2019-06-23')) > 0 AND SUM(o.quantity) < 10);

Case 1

Input

Books =

| book_id | name | available_from |

| ------- | ---------------- | -------------- |

| 1 | Kalila And Demna | 2010-01-01 |

| 2 | 28 Letters | 2012-05-12 |

| 3 | The Hobbit | 2019-06-10 |

| 4 | 13 Reasons Why | 2019-06-01 |

| 5 | The Hunger Games | 2008-09-21 |

Orders =

| -------- | ------- | -------- | ------------- |

| 1 | 1 | 2 | 2018-07-26 |

| 2 | 1 | 1 | 2018-11-05 |

| 3 | 3 | 8 | 2019-06-11 |

| 4 | 4 | 6 | 2019-06-05 |

| 5 | 4 | 5 | 2019-06-20 |

| 6 | 5 | 9 | 2009-02-02 |

| 7 | 5 | 8 | 2010-04-13 |

Output

| book_id | name |

| ------- | ---------------- |

| 2 | 28 Letters |

| 1 | Kalila And Demna |

| 5 | The Hunger Games |

Expected

| book_id | name |

| ------- | ---------------- |

| 1 | Kalila And Demna |

| 2 | 28 Letters |

| 5 | The Hunger Games |

Sunday, June 2, 2024

This is a continuation of several articles on openai search for Drone formation organization using elements as reference locations and nodes as predicted positions for drones. The elements can be stored in any non-proprietary vector database and a sample implementation would look something as follows and also called out in: https://github.com/ravibeta/semantic_search

The first step would be to install all the required packages and libraries. We use Python in this sample:

import warnings

warnings.filterwarnings(‘ignore’)

from datasets import load_dataset

from pinecone import Pinecone, ServerlessSpec

from DLAIUtils import Utils

import DLAIUtils

import os

import time

import torch

From tqdm.auto import tqdm

We assume the elements are mapped as embeddings in a 384-dimensional dense vector space.

A sample query would appear like this:

query = `what is node nearest this element?`

xq = model.encode(query)

xq.shape

(384,)

The next step is to set up the Pinecone vector database to upsert embeddings into it. These database index vectors make search and retrieval easy by comparing values and finding those that are most like one-another

utils = Utils()

PINECONE_API_KEY = utils.get_pinecone_api_key()

if INDEX_NAME in [index.name for index in pinecone.list_indexes()]:

pinecone.delete_index(INDEX_NAME)

print(INDEX_NAME)

pinecone.create_index(name=INDEX_NAME, dimension=model.get_sentence_embedding_dimension(), metric=’cosine’,spec=ServerlessSpec(cloud=’aws’, region=’us-west-2’))

index = pinecone.Index(INDEX_NAME)

print(index)

Then, the next step is to create embeddings for all the elements in the sample space and upsert them to Pinecone.

batch_size=200

vector_limit=10000

elements=element[:vector_limit]

import json

for i in tqdm(range(0, len(elements), batch_size)):

i_end = min(i+batch_size, len(elements))

ids = [str(x) for x in range(i, i_end)]

metadata = [{‘text’: text} for text in elements[i:i_end]]

xc = model.encode(elements[i:i_end])

records = zip(ids, xc, metadata)

index.upsert(vectors=records)

index.describe_index_stats()

Then the query can be run on the embeddings and the top matches can be returned.

def run_query(query):

embedding = model.encode(query).tolist()

results = index.query(top_k=10, vector=embedding, include_metadata=True, include_value)

for result in results[‘matches’]:

print(f”{round(result[‘score’], 2)}: {result[‘metadata’][‘node’]}”)

run_query(“what is node nearest this element?”)

With this, the embeddings-based search over elements is ready. In Azure, cosmos DB offers a similar semantic search and works as a similar vector database.

The following code outlines the steps using Azure AI Search

# configure the vector store settings, vector name is in the index of the search

endpoint: str = "<AzureSearchEndpoint>"

key: str = "<AzureSearchKey>"

index_name: str = "<VectorName>"

credential = AzureKeyCredential(key)

client = SearchClient(endpoint=endpoint,

index_name=index_name,

credential=credential)

# create embeddings

embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(

azure_deployment=azure_deployment,

openai_api_version=azure_openai_api_version,

azure_endpoint=azure_endpoint,

api_key=azure_openai_api_key,

)

# create vector store

vector_store = AzureSearch(

azure_search_endpoint=endpoint,

azure_search_key=key,

index_name=index_name,

embedding_function=embeddings.embed_query,

)

# create a query

docs = vector_store.similarity_search(

query=userQuery,

k=3,

search_type="similarity",

)

collections.insert_many(docs)

Saturday, June 1, 2024

Automation can also be achieved with Azure Data Factory aka ADF and a self-hosted integration runtime that comprises of a vm hosted on-premises and a Script activity. While typically associated with Data Transformation activities, a self-hosted integration runtime can participate in running any scripts and its invocation from ADF guarantees human and programmatic access from anywhere that has cloud connectivity. A self-hosted integration runtime is a component that connects data sources on-premises/ on Azure VM with cloud services in a secure and managed way

The Json syntax for defining a script looks something like this:

{

"name": "<activity name>",

"type": "Script",

"linkedServiceName": {

"referenceName": "<name>",

"type": "LinkedServiceReference"

"typeProperties": {

"scripts" : [

{

"text": "<Script Block>",

"type": "<Query> or <NonQuery>",

"parameters":[

{

"name": "<name>",

"value": "<value>",

"type": "<type>",

"direction": "<Input> or <Output> or <InputOutput>",

"size": 256

...

]

...

]

"scriptBlockExecutionTimeout": "<time>",

"logSettings": {

"logDestination": "<ActivityOutput> or <ExternalStore>",

"logLocationSettings":{

"linkedServiceName":{

"referenceName": "<name>",

"type": "<LinkedServiceReference>"

"path": "<folder path>"

}

The output can be collected everytime a script block is executed. There is a 5000 rows/4MB size limit but this is sufficient for most purposes.

A sample curl call would be something like this:

##! /usr/bin/python

import requests

# Set your ADF details

subscription_id = '<subscription_id>'

resource_group = '<resource_group>'

factory_name = '<factory_name>'

# Set the pipeline name you want to trigger

pipeline_name = 'your_pipeline_name'

# Construct the API URL

api_url = f"https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group}/providers/Microsoft.DataFactory/factories/{factory_name}/pipelines/{pipeline_name}/createRun?api-version=2017-03-01-preview"

# Make the POST request

response = requests.post(api_url)

# Check the response status

if response.status_code == 200:

print("Pipeline triggered successfully!")

else:

print(f"Error triggering pipeline. Status code: {response.status_code}")

## EOF

Friday, May 31, 2024

This is a continuation of IaC shortcomings and resolutions. In this section, we focus on the deployment of azure machine learning workspaces with virtual network peerings. When peerings are established traffic from any source in virtual network can flow to any destination in another. This comes very helpful when egress must be from one virtual network. Any number of virtual networks can be peered in hub-and-spoke model or as transit but they have their drawbacks and advantages. The impact this has on the infrastructure for az ml deployments is usually not called out in deployments and there can be quite a few surprises in the normal functioning of the workspace. Some of the previous articles explained these from the workspace side but in this section, we describe the network side in more detail, specifically the configuration options with peering.

When a local virtual network is peered with a remote virtual network, then there are four options presented to the user out of which only the first is selected and the rest remain unselected. Unfortunately, the default settings are not always appropriate for every situation and deserve special attention. These four options are:

1. Allow local network to access remote network

2. Allow local network to receive forwarded traffic from remote network

3. Allow gateway or route server in local network to forward traffic to remote network

4. Allow local network to use remote network’s gateway or route server.

Now, local and remote are interchangeable and these options repeated for the opposite direction as well with both sections of four choices each appearing on the ‘Add Peering’ page. This gives complete control over all aspects of treating the local and remote network in an asymmetrical manner rather than symmetrical bidirectionally-equal configuration.

Now, let’s revisit the options themselves assuming we have picked one of the networks as local. If the first option is not selected, there is no peering because traffic does not flow at all for the local network. This option is therefore selected by default in both sections and can be overridden selectively by the cloud network contributor role, but seldom done.

The second option is necessary for Microsoft hosts such as login.microsoftonline.com aka Microsoft Entra ID, management.azure.com aka Azure Portal and Azure Resource Manager to reach the local network. Failure to do so will result in incomplete handshakes during authentication as users begin to use resources in the local network.

The third and fourth options are for leveraging egress traffic to use gateway or route server. Often, a designated third remote virtual network was chained behind the remote and local networks for its firewall. When the firewall is enabled configuring the gateway or route server helps to ensure that all resources use that gateway or route server as their next hop. Setting this option allows the local network to use that single gateway or route server for all chained virtual networks. Between the third and the fourth options, the gateway or route server only happens to be in the local or the remote network. They can also be both selected with preference for local as well as remote appliance because third occurs before fourth.

In this way, peering configuration has complete control over the traffic between the participating networks. Traffic can optionally be observed with the help of a network watcher. This completes the discussion around network side and workspace side configuration options for ensuring full connectivity to the compute and successful code execution on those hosts.

Thursday, May 30, 2024

This is a summary of the book titled “Be Data Analytical: How to use analytics to turn data into value” written by Jordan Morrow and published by Kogan Page in 2023. The author is a data expert who empowers organizations by elevating their data literacy levels and supporting an ethos of curiosity and experimentation. He argues that decision making must comprise of both human intuition and data analytics. A data driven culture that supports curiosity and experimentation must be nurtured. Descriptive analytics must capture and communicate meaningful patterns and trends. Outperform your competition with diagnostic analytics to uncover root causes. Explore multiple outcomes with predictive analytics to improve strategic decision making. Build better descriptive, diagnostic, predictive and prescriptive analytics in six steps. Apply your data and analytics mindset to your life.

Data-driven activities involve leveraging data and analytics to assist in decision-making, allowing individuals and organizations to make better data-informed decisions. To improve decision capabilities, progress through four levels of analytics: descriptive, diagnostic, predictive, and prescriptive. Nurture a data-driven culture that supports curiosity and experimentation, aiming to build a "data and analytics mindset" that encourages experimentation and making mistakes.

Data-driven cultures should align with data ethics, embracing transparency and questioning data rigorously. Descriptive analytics can be used to capture and communicate meaningful patterns and trends, with various roles playing in generating data. Data analysts, data scientists, data architects, and leaders can all contribute to generating descriptive analytics.

To create a data-driven culture, embrace the democratization of data, giving everyone access to the information they need. By embracing data ethics, embracing transparency, and fostering a culture of data literacy, organizations can effectively problem-solve effectively with data.

Diagnostic analytics is a crucial tool for organizations to uncover root causes and make informed decisions. It helps organizations understand the reasons behind various phenomena, enabling them to make more informed decisions. This can be achieved using tools like Tableau, Microsoft Power BI, and Qlik, as well as coding languages like R and Python. Predictive analytics is another powerful tool for strategic decision-making, allowing organizations to anticipate supply-chain challenges and forecast credit card delinquency rates. Leaders play a significant role in driving better predictive analytics, requiring data literacy and data-driven decision-making. Data science platforms like RapidMiner can be used to perform predictive analytics, allowing users to understand data visually. While not everyone in the organization will build predictive analytics, democratizing predictions can ensure the right parties have access to the necessary information. Prescriptive analytics, which uses machine learning to make recommendations and create action steps, can also be beneficial. However, it's important to remember that predictions are not prophecies and should be communicated clearly.

Prescriptive analytics is a powerful tool that can be used to make decisions based on patterns and trends. However, it is essential to maintain the human element in analytics, as it allows for the freedom to change your workout regimen and downsize your company. Everyone at your company plays a role in building these analytics, from C-suite executives to data analysts, engineers, and data scientists. To build better analytics, follow six steps:

1. Awareness: Ensure staff are familiar with the four levels of analytics, their problems, and solutions.

2. Understanding: Understand how each phase of data analytics fits within the bigger picture, helping you achieve broader goals.

3. Assessing: Evaluate personal skills and the organization as a whole, identifying gaps to fill.

4. Questioning: Improve each phase of analytics by asking questions about data quality, purpose, and future implications.

5. Learning: Gain data literacy and improve problem-solving abilities.

6. Implementation: Don't waste valuable insights and execute data-informed decisions.

Applying a data and analytics mindset to your life is crucial, as failures present opportunities to improve and refine your approach to data analytics.

Previous book summary: BookSummary99.docx

My writing: MLOps3.docx