Cluster computing

Tuesday, April 9, 2024

Question: How to execute Apache Spark code in Azure Machine Learning Workspace as jobs on a non-interactive cluster?

Answer: Unlike Compute Instances on Azure Machine Learning workspace, a non-interactive cluster creation does not take initialization or startup scripts to configure the libraries or packages on the instance. Errors encountered on running a sample Spark code as shown here, will likely result in JAVA_HOME-not-set error or JAVA_GATEWAY_EXITED error.

from pyspark import SparkContext

sc = SparkContext.getOrCreate()

# check that it really works by running a job

# example from http://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections

data = range(10000)

distData = sc.parallelize(data)

result = distData.filter(lambda x: not x&1).take(10)

print(result)

# Out: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

To execute Apache Spark code in non-interactive jobs in Azure Machine Learning Workspace, we build custom environments. Custom environments allow us to specify the necessary dependencies, packages, and configurations required to run your Spark code.

Here's a step-by-step guide on how to build custom environments for executing Apache Spark code in Azure Machine Learning Workspace:

Define the environment: Start by defining the environment dependencies in a conda or pip environment file. Specify the required Python version, Spark version, and any additional packages or libraries needed for your code. For example, in the option to create an environment using an existing curated one, choose mldesigner:23 and customize the Conda specification with:

name: MyCustomEnvironment

channels:

- conda-forge

- defaults

dependencies:

- python=3.8

- numpy

- pyspark

- pip

- pip:

- azureml-core

- ipython

- ipykernel

- pyspark

Specify the environment in the job configuration: When submitting a Spark job in Azure Machine Learning Workspace, we specify the custom environment that we created in the job configuration. This ensures that the job executes using the desired environment. Jobs must be submitted as part of an experiment and this helps with organization and locating jobs from their listing. Experiments are different from Environments.

Execute the job: We submit the Spark job using the Azure Machine Learning SDK or Azure portal. The job will be executed in the specified environment, ensuring that all required dependencies are available.

By building custom environments, we ensure that your Spark code runs consistently and reproducibly in Azure Machine Learning Workspace, regardless of the underlying infrastructure or dependencies.

Azure Machine Learning Workspace also provides pre-built environments with popular data science and machine learning frameworks like Spark, TensorFlow, and PyTorch. These environments are optimized and ready to use out of the box and are helpful for training models. Leveraging built-in over custom helps with automatic maintenance.

Monday, April 8, 2024

There are N points (numbered from 0 to N−1) on a plane. Each point is colored either red ('R') or green ('G'). The K-th point is located at coordinates (X[K], Y[K]) and its color is colors[K]. No point lies on coordinates (0, 0).

We want to draw a circle centered on coordinates (0, 0), such that the number of red points and green points inside the circle is equal. What is the maximum number of points that can lie inside such a circle? Note that it is always possible to draw a circle with no points inside.

Write a function that, given two arrays of integers X, Y and a string colors, returns an integer specifying the maximum number of points inside a circle containing an equal number of red points and green points.

Examples:

1. Given X = [4, 0, 2, −2], Y = [4, 1, 2, −3] and colors = "RGRR", your function should return 2. The circle contains points (0, 1) and (2, 2), but not points (−2, −3) and (4, 4).

class Solution {

public int solution(int[] X, int[] Y, String colors) {

// find the maximum

double max = Double.MIN_VALUE;

int count = 0;

for (int i = 0; i < X.length; i++)

{

double dist = X[i] * X[i] + Y[i] * Y[i];

if (dist > max)

{

max = dist;

}

for (double i = Math.sqrt(max) + 1; i > 0; i -= 0.1)

{

int r = 0;

int g = 0;

for (int j = 0; j < colors.length(); j++)

{

if (Math.sqrt(X[j] * X[j] + Y[j] * Y[j]) > i)

{

continue;

}

if (colors.substring(j, j+1).equals("R")) {

r++;

}

else {

g++;

}

if ( r == g && r > 0) {

int min = r * 2;

if (min > count)

{

count = min;

}

return count;

}

Compilation successful.

Example test: ([4, 0, 2, -2], [4, 1, 2, -3], 'RGRR')

Example test: ([1, 1, -1, -1], [1, -1, 1, -1], 'RGRG')

Example test: ([1, 0, 0], [0, 1, -1], 'GGR')

Example test: ([5, -5, 5], [1, -1, -3], 'GRG')

Example test: ([3000, -3000, 4100, -4100, -3000], [5000, -5000, 4100, -4100, 5000], 'RRGRG')

#another

#codingexercise

Given an array of strings arr. String s is a concatenation of a sub-sequence of arr which have unique characters.

import java.util.*;

import java.lang.Character;

class Solution {

public int maxLength(List<String> arr) {

int N = arr.size();

int max = Integer.MIN_VALUE;

for (int i = 0; i < (1<<N); i++) {

List<Integer> combination = new ArrayList<>();

for (int j = 0; j < arr.size(); j++) {

if ((i & (1 << j)) > 0) {

combination.add(j);

}

int count = getDistinctCount(arr, combination);

if (count > max){

max = count;

}

return max;

}

public int getDistinctCount(List<String> A, List<Integer> combination) {

Map<Character, Integer> charMap = new HashMap<>();

for (int i = 0; i < combination.size(); i++) {

String word = A.get(combination.get(i));

for (int j = 0; j < word.length(); j++) {

if (charMap.containsKey(Character.valueOf(word.charAt(j)))) {

return 0;

}

charMap.put(Character.valueOf(word.charAt(j)), 1);

}

return charMap.keySet().size();

}

Sunday, April 7, 2024

Given an array of strings arr. String s is a concatenation of a sub-sequence of arr which have unique characters.

import java.util.*;

import java.lang.Character;

class Solution {

public int maxLength(List<String> arr) {

int N = arr.size();

int max = Integer.MIN_VALUE;

for (int i = 0; i < (1<<N); i++) {

List<Integer> combination = new ArrayList<>();

for (int j = 0; j < arr.size(); j++) {

if ((i & (1 << j)) > 0) {

combination.add(j);

}

int count = getDistinctCount(arr, combination);

if (count > max){

max = count;

}

return max;

}

public int getDistinctCount(List<String> A, List<Integer> combination) {

Map<Character, Integer> charMap = new HashMap<>();

for (int i = 0; i < combination.size(); i++) {

String word = A.get(combination.get(i));

for (int j = 0; j < word.length(); j++) {

if (charMap.containsKey(Character.valueOf(word.charAt(j)))) {

return 0;

}

charMap.put(Character.valueOf(word.charAt(j)), 1);

}

return charMap.keySet().size();

}

#codingexercise:

BarChartRectangleStreaming.docx

Saturday, April 6, 2024

This is a summary of the book “The Aisles have eyes: How retailers track your shopping, strip your privacy and define your power” written University of Pennsylvania’s Joseph Turow and published by Yale University Press in 2017. He writes about a father who first learned of his teen aged daughter’s pregnancy when Target mailed maternity related sales offer to his home. How could the department store know that she was pregnant before her family knew. His report includes data collection by both online and brick-and-mortar stores, their ways to collect the data not just from their own site but also from the smartphones, wi-fi, camera, GPS, and other devices of their customers, the combination of tracking with data mined from sources to mailing out coupons, as well as the indoctrination of individuals to accept growing levels of intrusion.

Surveillance technology yields data that becomes proprietary, but retailers also ensure that customers are comfortable giving away data. For example, rewards program requires you to sign up. Privacy advocates are generally concerned about monitoring, but their alarms are largely subdued. Smartphones, with their ubiquity, enable even physical retailers to collect data and this helps them to identify their most valuable customers. Marketers even hope that wearable technology will provide a constant stream of customer activity. With customers becoming more insensitive, the data generation ever increasing, and the promiscuous monitoring pose insurmountable challenges to the government and privacy advocates to draft and enforce regulations.

In the early 21st century, traditional retail stores had to adapt their business models to compete with online retailers like Amazon. To do so, they needed to replicate e-stores' tracking and targeting in the real world, mining databases to discriminate among customers and send personalized advertisements and offers. This required a new level of surveillance and intrusion into customers' personal lives. Retailers had to get consumers comfortable with this kind of tracking and accept that giving confidential information is a normal part of their shopping.

These retailers focused on niche markets to compete with Walmart, whose purchasing clout and efficiency made competing on price an impossible task. To identify valuable niche customers, retailers had to collect data on shoppers. Data companies like ShopperTrak and Euclid offer technology that enables stores to exploit Wi-Fi or Bluetooth to link with shoppers' smartphones. InMarket offers a tactic using Bluetooth Low Energy (BLE) technology, where stores install inexpensive BLE "beacons" on the selling floor to detect inMarket's code in smartphone apps.

In 2010, smartphone manufacturers equipped their products with GPS chips, enabling retailers to track shoppers beyond their stores. InMarket and xAd track consumers' locations and deduce their reasons for being in a store, providing retailers with clues to the most promising advertising targets. Retailers may eventually be able to follow customers into their homes by exploiting "the Internet of things" networks of smart appliances and remote-control devices. Wearable technology like the Apple Watch and facial recognition can help gather continual data on habits, location, buying patterns, and health. Retailers are also implementing a hidden curriculum, teaching customers to give up personal information and accept surveillance and discrimination in exchange for convenience and coupons. This includes redefining customer loyalty by offering discounts and other perks to good customers. Discriminating retailers are building profiles of individual shoppers and using statistical analysis to rate their attractiveness, targeting the most attractive customers with offers and discounts designed to keep them coming back.

Such retailers value influential customers based on their influence and spending power. They assess this by identifying customers with the most people in their social networks and cross-referencing this information with data from a tool called Radian6. However, consumers have little input into retailing's transformation, as they have little choice but to accept the retailer's "privacy policy." Surveys show that most consumers want more control over their information, but are unaware of the mechanisms behind data mining, the government's ability to protect their privacy, and are uncomfortable with tracking. New regulations could slow the progress of retail monitoring, but the best approach would be to require an opt-in for every company that collects consumer data. Students should learn digital media and marketing from middle school, and the public should be educated about marketers' hidden agendas and privacy policies. Paying attention to the privacy policy and opting out of unwanted agreements is a good practice for individual customers.

Summarizing Software: SummarizerCodeSnippets.docx.

Thursday, April 4, 2024

Some methods of organization for large scale Infrastructure-as-a-Code deployments.

The purpose of IaC is to provide a dynamic, reliable, and repeatable infrastructure suitable for cases where manual approaches and management practices cannot keep up. When automation increases to the point of becoming a cloud-based service responsible for the deployment of cloud resources and stamps that provision other services that are diverse, consumer facing and public cloud general availability services, some learnings can be called out that apply universally across a large spectrum of industry clouds.

A service that deploys other services must accept IaC deployment logic with templates, intrinsics, and deterministic execution that works much like any other workflow management system. This helps to determine the order in which to run them and with retries. The tasks are self-described. The automation consists of a scheduler to trigger scheduled workflows and to submit tasks to the executor to run, an executor to run the tasks, a web server for a management interface, a folder for the directed acyclic graph representing the deployment logic artifacts, and a metadata database to store state. The workflows don’t restrict what can be specified as a task which can be an Operator or a predefined task using say Python, a Sensor which is entirely about waiting for an external event to happen, and a Custom task that can be specified via a Python function decorated with a @task.

The organization of such artifacts posed two necessities. First, to leverage the builtin templates and deployment capabilities of the target IaC provider as well as their packaging in the format suitable to the automation that demands certain declarations, phases, and sequences to be called out. Second the co—ordination of context management switches between automation service and IaC provider. This involved a preamble and an epilogue to a context switch for bookkeeping and state reconciliation.

This taught us that large IaC authors are best served by uniform, consistent and global naming conventions, registries that can be published by the system for cross subscription and cross region lookups, parametrizing diligently at every scope including hierarchies, leveraging dependency declarations, and reducing the need for scriptability in favor of system and user defined organizational units of templates. Leveraging supportability via read-only stores and frequently publishing continuous and up-to-date information on the rollout helps alleviate the operations from the design and development of IaC.

IaC writers frequently find themselves in positions where the separation between pipeline automation and IaC declarations are not clean, self-contained or require extensive customizations. One of the approaches that worked on this front is to have multiple passes on the development. With one pass providing initial deployment capability and another pass consolidating and providing best practice via refactoring and reusability. Enabling the development pass to be DevOps based, feature centric and agile helps converge to a working solution with learnings that can be carried from iteration to iteration. The refactoring pass is more generational in nature. It provides cross-cutting perspectives and non-functional guarantees.

A library of routines, operators, data types, global parameters and registries are almost inevitable with large scale IaC deployments but unlike the support for programming language-based packages, these are often organically curated in most cases and often self-maintained. Leveraging tracking and versioning support of source control, its possible to provide compatibility as capabilities are made native to the IaC provider or automation service.

Wednesday, April 3, 2024

This is a continuation of articles on IaC shortcomings and resolutions with regard to public cloud deployments.

When securing outbound access with a NAT Gateway in the Azure public cloud, we can choose between two routing options: Microsoft routing and user-defined routing. Let's discuss the benefits and drawbacks of each:

Microsoft Routing: Benefits:

Simplicity: Microsoft routing is the default option, and it requires minimal configuration. It automatically handles routing between subnets and virtual networks.
Ease of management: As Microsoft handles the routing, we don't need to manage any routing tables or configurations manually.
Automatic failover: Microsoft routing provides built-in redundancy and automatic failover, ensuring high availability.

Drawbacks:

Limited control: With Microsoft routing, we have limited control over the routing decisions. We can't customize the routing paths or add specific routing rules.
Less flexibility: It may not be suitable for complex networking scenarios where more advanced routing options are required.

User-Defined Routing: Benefits:

Enhanced control: User-defined routing allows us to have granular control over the routing decisions. We can define custom routing tables and specify the desired paths for outbound traffic.
Advanced routing capabilities: With user-defined routing, we can implement complex routing scenarios, such as policy-based routing and route filtering.
Integration with on-premises networks: User-defined routing enables us to establish connectivity between Azure and on-premises networks, using VPN or ExpressRoute.

Drawbacks:

Increased management complexity: User-defined routing requires manual configuration and management of routing tables, which can be more complex and time-consuming.
Potential for misconfiguration: If not properly configured, user-defined routing can lead to connectivity issues or suboptimal routing.
Higher cost: User-defined routing may incur additional costs due to the need for more resources and increased management effort.

Ultimately, the choice between Microsoft routing and user-defined routing depends on our specific requirements and the complexity of our networking setup. If we prefer simplicity and don't require advanced routing capabilities, Microsoft routing can be a suitable option. On the other hand, if we need more control and flexibility over routing decisions, or if we have complex networking requirements, user-defined routing may be more appropriate

Monday, April 1, 2024

Problem: Given a weighted bidirectional graph with N nodes and M edges and all the weights as distinct positive numbers, find the maximum number of edges that can be visited on traversing the graph such that the weights are ascending.

Solution: When a weighted edge is encountered in an ascending order between nodes, say u and v, it must be the first edge of the path starting at either u or v and no other nodes. In addition, that path starts at one vertex, goes through edge uv and then the remaining longest ascending path up to the other vertex. Therefore, the weights accumulated at both these nodes is the maximum of (w[u], w[v] + 1) and (w[v], w[u]+1) in an array w of weights of longest ascending paths starting at that vertex.

public static int solution_unique_weights(int N, int[] src, int[] dest, int[] weight) {

int M = weight.length;

int[] e = new int[N];

Integer[] index = new Integer[M];

for (int i = 0; i <M; i++) { index[i] = i; }

Comparator<Integer> comparator = (i, j) -> weight[j] - weight[i];

Arrays.sort(index, 0, M, comparator);

for (int I = 0; i< M; i++) {

int u = src[index[i]];

int v = dest[index[i]];

int count = Math.max(Math.max(e[u], e[v] + 1), Math.max(e[v], e[u]+1));

e[u] = count;

e[v] = count;

}

return Arrays.stream(e).max().getAsInt();

}

src[0] = 0 dest[0] = 1 weight[0] = 4

src[1] = 1 dest[1] = 2 weight[1] = 3

src[2] = 1 dest[2] = 3 weight[2] = 2

src[3] = 2 dest[3] = 3 weight[3] = 5

src[4] = 3 dest[4] = 4 weight[4] = 6

src[5] = 4 dest[5] = 5 weight[5] = 7

src[6] = 5 dest[6] = 0 weight[6] = 9

src[7] = 3 dest[7] = 2 weight[7] = 8

index: 0 1 2 3 4 5 6 7 // before sort

index: 2 1 0 3 4 5 7 6 // after sort

0 1 0 1 0 0 0 0

0 2 2 1 0 0 0 0

3 3 2 1 0 0 0 0

3 3 3 4 4 0 0 0

3 3 3 4 5 5 0 0

3 3 4 4 5 5 0 0

6 3 4 4 5 6 0 0