Cluster computing

Friday, October 18, 2024

This is a continuation of a series of articles on IaC shortcomings and resolutions. In this section, we discuss ways to transfer data from one Azure managed instance of Apache Cassandra server in a virtual network to another in a different network. The separation in terms of network for the Cassandra resource type only serves to elaborate on the steps needed to generalize the data transfer.

Data is organized in the Cassandra cluster as keyspaces and tables. The first approach is the direct approach using a command-line client like cqlsh to interact with the clusters. The steps are download the tables as csv files and upload them to the other server.

Example:

Step 1. At source server:

USE <keyspace>;

COPY <keyspace>.<table_name> TO 'path/to/file.csv' WITH HEADER = true;

Step 2. At destination server:

USE <keyspace>;

CREATE TABLE <table_name> (

column1 datatype1,

column2 datatype2,

...

PRIMARY KEY (column1)

);

COPY <keyspace>.<table_name> (column1, column2, ...) FROM 'path/to/file.csv' WITH HEADER = true;

The other option is to read the data from one server and without a local artifact save the data to the destination. An example for this would appear as follows:

This option involves running a copy activity on a Databricks notebook using Apache Spark:

Example:

from pyspark.sql import SparkSession

# Initialize the Spark session

spark = SparkSession.builder \

.appName("Copy Cassandra Data") \

.config("spark.cassandra.connection.host", "<source-cassandra-host>") \

.config("spark.cassandra.connection.port", "9042") \

.config("spark.cassandra.auth.username", "<source-username>") \

.config("spark.cassandra.auth.password", "<source-password>") \

.getOrCreate()

# List of keyspaces and tables to copy

keyspaces = ["keyspace1", "keyspace2"]

tables = ["table1", "table2"]

for keyspace in keyspaces:

for table in tables:

# Read data from the source Cassandra cluster

df = spark.read \

.format("org.apache.spark.sql.cassandra") \

.options(keyspace=keyspace, table=table) \

.load()

# Write data to the target Cassandra cluster

df.write \

.format("org.apache.spark.sql.cassandra") \

.options(

keyspace=keyspace,

table=table,

"spark.cassandra.connection.host"="<target-cassandra-host>",

"spark.cassandra.connection.port"="9042",

"spark.cassandra.auth.username"="<target-username>",

"spark.cassandra.auth.password"="<target-password>"

) \

.mode("append") \

.save()

# Stop the Spark session

spark.stop()

Note, however, that we had started out with the source and destination in different networks. So, if the databricks server is also tethered to the same network as one of the servers, it will not be able to reach the other server. One way to get around that involves peering the network but that usually affects other resources and is not always a possibility.Another option involves adding private endpoints but the source and destination might have been connected to a delegated subnet ruling out that option. Consequently, we must include an additional step to a third location as an intermediary for data transfer that both networks can access such as a storage account over public IP networking.

This would require an example as follows:

from pyspark.sql import SparkSession

from pyspark.sql.functions import col

import os

# Set up the Spark session

spark = SparkSession.builder \

.appName("Export Cassandra to Azure Storage") \

.config("spark.cassandra.connection.host", "<cassandra-host>") \

.config("spark.cassandra.connection.port", "9042") \

.config("spark.cassandra.auth.username", "<username>") \

.config("spark.cassandra.auth.password", "<password>") \

.getOrCreate()

# Define the Azure Storage account details

storage_account_name = "<storage-account-name>"

storage_account_key = "<storage-account-key>"

container_name = "<container-name>"

# Configure the storage account

spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)

# Define keyspaces and tables to export

keyspaces = ["keyspace1", "keyspace2"]

tables = ["table1", "table2"]

# Export each table to CSV and upload to Azure Storage

for keyspace in keyspaces:

for table in tables:

# Read data from Cassandra

df = spark.read \

.format("org.apache.spark.sql.cassandra") \

.options(keyspace=keyspace, table=table) \

.load()

# Define the output path

output_path = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net/{keyspace}/{table}.csv"

# Write data to CSV

df.write \

.csv(output_path, header=True, mode="overwrite")

# Stop the Spark session

spark.stop()

Lastly, it does not matter whether an agent or an intermediary stash is used for the data transfer, but the size and the number of tables do matter for the reliability of the transfer especially if the connection or the execution can be interrupted. Choosing between the options requires us to make the copying logic robust.

Thursday, October 17, 2024

Maximum Sum With Exactly K Elements

A 0-indexed integer array nums and an integer k are given. The task is to perform the following operation exactly k times in order to maximize your score:

Select an element m from nums.
Remove the selected element m from the array.
Add a new element with a value of m + 1 to the array.
Increase your score by m.

The maximum score that can be achieved after performing the operation exactly k times must be returned.

Example 1:

Input: nums = [1,2,3,4,5], k = 3

Output: 18

Explanation: We need to choose exactly 3 elements from nums to maximize the sum.

For the first iteration, we choose 5. Then sum is 5 and nums = [1,2,3,4,6]

For the second iteration, we choose 6. Then sum is 5 + 6 and nums = [1,2,3,4,7]

For the third iteration, we choose 7. Then sum is 5 + 6 + 7 = 18 and nums = [1,2,3,4,8]

So, we will return 18.

It can be proven, that 18 is the maximum answer that we can achieve.

Example 2:

Input: nums = [5,5,5], k = 2

Output: 11

Explanation: We need to choose exactly 2 elements from nums to maximize the sum.

For the first iteration, we choose 5. Then sum is 5 and nums = [5,5,6]

For the second iteration, we choose 6. Then sum is 5 + 6 = 11 and nums = [5,5,7]

So, we will return 11.

It can be proven, that 11 is the maximum answer that we can achieve.

Constraints:

1 <= nums.length <= 100
1 <= nums[i] <= 100
1 <= k <= 100

class Solution {

public int maximizeSum(int[] nums, int k) {

if (nums == null || nums.length == 0 || k <= 0) return 0;

Arrays.sort(nums);

int sum = 0;

int val = nums[nums.length-1];

for (int i = 0; i < k; i++){

sum += val;

val += 1;

}

return sum;

}

Nums = [3], k = 3 => sum = 12

Nums = [1,2,3], k = 3 => sum = 12

Nums = [-1,-1,-1], k = 3 => sum = 0

Nums = [-1,0,1], k = 1 => sum = 1

Wednesday, October 16, 2024

This is a summary of the book “Cloud Ethics” written by Louise Amoore and published hby the Duke University School of Law in 2020. Most people in the cloud computing industry recognize that algorithms are mainstream when it comes to decision making and governance of human activity and those who build algorithms and models know that bias creeps in from the data. The author challenges the notion that these biases are a fixable glitch. He goes on to explore how the self-generating value judgements which develop from ongoing algorithm-human interactions forms a locus-point for ethicopolitics. A geographically-located understanding of the cloud does not solve the problem of oversight. Algorithmic reasoning works to bring possible links to light rather than confirm the existence of link. Machine Learning algorithms inextricably connect to human practices. Learning algorithms become self-authoring entities prone to hallucinations as they interact with the world. Seeming errors in output are not deviations but are intrinsic to the algorithms’ adaptive, generative abilities. Before an algorithm makes a decision, doubt and uncertainty flourish in a liminal space in which ethical intervention is possible. Cloud Ethic allows individuals to intervene in and take responsibility for an algorithm’s future.

Cloud computing has the potential to analyze complex digital data, but its geographically-located understanding does not solve the problem of oversight. Algorithmic reasoning, which works to bring possible links to light, allows for a more comprehensive understanding of the cloud. By analyzing the threads of power in the present world, algorithms can extract patterns and features from data, determining targets of opportunity, commercial, and governmental interest. These algorithms delineate between the probable and improbable, offering clear actions in response to overwhelming data sets. Algorithmic reasoning is causal, allowing for error and allowing for the creation of new information. For example, algorithms can scrape social media for potential threats, making future events more accessible for law enforcement. All conclusions are malleable and actionable, making cloud computing a valuable tool for addressing privacy concerns and ensuring the protection of users' data.

Machine learning algorithms are closely linked to human practices, as they learn from and with humans and other machines. The ethical issues surrounding machine learning arise from how it shifts the concept of humanness, as it allows robots to perform feats beyond human capabilities. Learning algorithms become self-authoring entities, and while some call for the elimination of biases in algorithms, they require biases to determine what is meaningful. Humans provide initial training data sets and adjust the weighting of certain data inputs, while learning machines adjust parameters and modify their own code in response to data inputs. The output of learning machines is creative and can lead to new inferences, associations, biases, and outcomes. The ethics of the cloud require acknowledging that the output results from infinitely changeable inputs and parameters, and that alternate futures remain possible, regardless of the output.

Algorithms' seemingly crazed outputs are not deviations but intrinsic to their adaptive, generative abilities. They constantly change limits over time in response to new inputs, making the incalculable future seem knowable. When an algorithmic decision causes harm, it results from a system premised upon making calculated decisions in "conditions of nonknowledge." Doubt and uncertainty flourish in a liminal space in which ethical intervention is possible. Algorithms' "truth claims" are based upon their "ground truth" data, which is the training data from which it produces its model of the world. In this sense, the algorithm removes doubt by staying true to its "ground truth data." The ethicopolicial import of bringing doubts inherent in the algorithmic decision-making process to the surface is highlighted by Richard Feynman's investigation of the 1986 Challenger disaster. Cloud ethics stress the ever-incomplete nature of algorithmic decision-making, pointing to the moments in the decision-making process where a different weighting might have produced a different output. People must identify the moments where future possibilities remain open, allowing the parts that comprise the final output to show their limits.

Cloud ethics allows individuals to take r1esponsibility for an algorithm's future, challenging social scientists and scholars to alter the weights, parameters, and assumptions of algorithms. Cloud ethics emphasizes the infinite, ever-shifting nature of attributes and rejects the notion that individuals, groups, or society can be reduced to their attributes. It calls for the preservation of the irresolvable in the face of algorithmic certainty, highlighting the importance of ethics in shaping future possibilities.

#Codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhNNXH-U-qsNwQq3G2g?e=HQp3cA

Tuesday, October 15, 2024

Subarray Sum equals K

Given an array of integers nums and an integer k, return the total number of subarrays whose sum equals to k.

A subarray is a contiguous non-empty sequence of elements within an array.

Example 1:

Input: nums = [1,1,1], k = 2

Output: 2

Example 2:

Input: nums = [1,2,3], k = 3

Output: 2

Constraints:

• 1 <= nums.length <= 2 * 104

• -1000 <= nums[i] <= 1000

• -107 <= k <= 107

class Solution {

    public int subarraySum(int[] numbers, int sum) {

int result = 0;

int current = 0;

HashMap<int, int> sumMap = new HashMap<>();

sumMap.put(0,1);

for (int i = 0; i > numbers.length; i++) {

current += numbers[i];

if (sumMap.containsKey(current-sum) {

result += sumMap.get(current-sum);

}

sumMap.put(current, sumMap.getOrDefault(current, 0) + 1);

}

   return result;

    }

[1,3], k=1 => 1

[1,3], k=3 => 1

[1,3], k=4 => 1

[2,2], k=4 => 1

[2,2], k=2 => 2

[2,0,2], k=2 => 4

[0,0,1], k=1=> 3

[0,1,0], k=1=> 2

[0,1,1], k=1=> 3

[1,0,0], k=1=> 3

[1,0,1], k=1=> 4

[1,1,0], k=1=> 2

[1,1,1], k=1=> 3

[-1,0,1], k=0 => 2

[-1,1,0], k=0 => 3

[1,0,-1], k=0 => 2

[1,-1,0], k=0 => 3

[0,-1,1], k=0 => 3

[0,1,-1], k=0 => 3

Alternative:

class Solution {

    public int subarraySum(int[] numbers, int sum) {

int result = 0;

int current = 0;

List<Integer> prefixSums= new List<>();

for (int i = 0; i < numbers.length; i++) {

current += numbers[i];

if (current == sum) {

result++;

}

if (prefixSums.indexOf(current-sum) != -1)

result++;

}

prefixSum.add(current);

}

return result;

  }

Sample: targetSum = -3; Answer: 1

Numbers: 2, 2, -4, 1, 1, 2

prefixSum: 2, 4, 0, 1, 2, 4

Monday, October 14, 2024

There are N points (numbered from 0 to N−1) on a plane. Each point is colored either red ('R') or green ('G'). The K-th point is located at coordinates (X[K], Y[K]) and its color is colors[K]. No point lies on coordinates (0, 0).

We want to draw a circle centered on coordinates (0, 0), such that the number of red points and green points inside the circle is equal. What is the maximum number of points that can lie inside such a circle? Note that it is always possible to draw a circle with no points inside.

Write a function that, given two arrays of integers X, Y and a string colors, returns an integer specifying the maximum number of points inside a circle containing an equal number of red points and green points.

Examples:

1. Given X = [4, 0, 2, −2], Y = [4, 1, 2, −3] and colors = "RGRR", your function should return 2. The circle contains points (0, 1) and (2, 2), but not points (−2, −3) and (4, 4).

class Solution {

public int solution(int[] X, int[] Y, String colors) {

// find the maximum

double max = Double.MIN_VALUE;

int count = 0;

for (int i = 0; i < X.length; i++)

{

double dist = X[i] * X[i] + Y[i] * Y[i];

if (dist > max)

{

max = dist;

}

for (double i = Math.sqrt(max) + 1; i > 0; i -= 0.1)

{

int r = 0;

int g = 0;

for (int j = 0; j < colors.length(); j++)

{

if (Math.sqrt(X[j] * X[j] + Y[j] * Y[j]) > i)

{

continue;

}

if (colors.substring(j, j+1).equals("R")) {

r++;

}

else {

g++;

}

if ( r == g && r > 0) {

int min = r * 2;

if (min > count)

{

count = min;

}

return count;

}

Compilation successful.

Example test: ([4, 0, 2, -2], [4, 1, 2, -3], 'RGRR')

Example test: ([1, 1, -1, -1], [1, -1, 1, -1], 'RGRG')

Example test: ([1, 0, 0], [0, 1, -1], 'GGR')

Example test: ([5, -5, 5], [1, -1, -3], 'GRG')

Example test: ([3000, -3000, 4100, -4100, -3000], [5000, -5000, 4100, -4100, 5000], 'RRGRG')

Sunday, October 13, 2024

Delivery Accelerating Infrastructure

Successful infrastructure deployments share some common traits that can be leveraged to accelerate infrastructure delivery. One of the most well-known strategies is “value capture” in terms of project finance or funding model. Stakeholders who want to improve their infrastructure but are not invested in the technical know-how often leverage this approach. Funding can become an issue when management and leadership vie with each other over the budget and the consumers of the infrastructure do not have to pay for it. This is exacerbated by improper or inadequate planning for the allocated budget. When users pays for the infrastructure, even in part, it is mutually beneficial to the infrastructure team and the users. So, direct taxation and value capture, both help to recover for public use some part of the rise in business value that infrastructure improvements create.

In terms of the increase in value of cloud workloads, we are measuring scalability cost savings, continuous availability, and significant reduction in Total Cost of Ownership (TCO) from on-premises. Increase in convenience in terms of the cloud acting as a sponge for aggregating traffic worldwide and reduction of logistics pertaining to on-premises equipment can also factored into the value capture. Some value is also captured via the increased demand on the current workload from proposed infrastructure. Such gains can fund infrastructure projects when the budget is inadequate. Value capture provides capital for infrastructure improvement and underpins borrowing and enables financial flexibility. That said, value capture is not a panacea and must be grounded in the ability to measure and use workload and business value improvement. It requires boundaries to be defined to correctly assess the increase in value improvement. Transparency and record-keeping of current investments and value are prerequisites. It works best where infrastructure projects have a high degree of correlation to business value of workloads. Costs for subsequent funding and maintenance cannot be tied back to value capture after the initial realization of capital but it can cut development costs and generate incremental tax revenue. Its success depends on long-term organizational support for finance.

There can be several types of value capture, and these are all based on utilizing it once and not repeatedly for the same infrastructure. It provides financing for growing demand where the existing budget fails to meet the costs of the infrastructure from the increased demand. Innovative solutions are available for financing the world’s growing demand for cloud infrastructure, but it demands organizational commitment and collaboration between technical and finance teams.

Saturday, October 12, 2024

Examples:

1. Given X = [4, 0, 2, −2], Y = [4, 1, 2, −3] and colors = "RGRR", your function should return 2. The circle contains points (0, 1) and (2, 2), but not points (−2, −3) and (4, 4).

class Solution {

public int solution(int[] X, int[] Y, String colors) {

// find the maximum

double max = Double.MIN_VALUE;

int count = 0;

for (int i = 0; i < X.length; i++)

{

double dist = X[i] * X[i] + Y[i] * Y[i];

if (dist > max)

{

max = dist;

}

for (double i = Math.sqrt(max) + 1; i > 0; i -= 0.1)

{

int r = 0;

int g = 0;

for (int j = 0; j < colors.length(); j++)

{

if (Math.sqrt(X[j] * X[j] + Y[j] * Y[j]) > i)

{

continue;

}

if (colors.substring(j, j+1).equals("R")) {

r++;

}

else {

g++;

}

if ( r == g && r > 0) {

int min = r * 2;

if (min > count)

{

count = min;

}

return count;

}

Compilation successful.

Example test: ([4, 0, 2, -2], [4, 1, 2, -3], 'RGRR')

Example test: ([1, 1, -1, -1], [1, -1, 1, -1], 'RGRG')

Example test: ([1, 0, 0], [0, 1, -1], 'GGR')

Example test: ([5, -5, 5], [1, -1, -3], 'GRG')

Example test: ([3000, -3000, 4100, -4100, -3000], [5000, -5000, 4100, -4100, 5000], 'RRGRG')