Cluster computing

Friday, August 2, 2024

When describing the Azure Machine Learning Workspace deployments via IaC and its shortcomings and corresponding resolutions, it was hinted that the workspace and all its infrastructure concerns can be resolved at deployment time so that the data scientists are free to focus on business use cases. Part of this setup involves kernel creation that can be done via scripts during the creation and assignment of compute to the data scientists. There are two scripts required one at the creation time and other at the start of the compute. Some commads require the terminal to be restarted, so the split in the scripts helps with the stages to specify them. For example, to provision a python 3.11 and spark 3.5 based custom kernel, the following scripts come useful:

#!/bin/bash

set -e

curl https://repo.anaconda.com/archive/Anaconda3-2024.02-1-Linux-x86_64.sh --output Anaconda3-2024.02-1-Linux-x86_64.sh

chmod 755 Anaconda3-2024.02-1-Linux-x86_64.sh

./Anaconda3-2024.02-1-Linux-x86_64.sh -b

# This script creates a custom conda environment and kernel based on a sample yml file.

echo "installation complete"

cat <<EOF > env.yaml

name: python3.11_spark3.5

channels:

- conda-forge

- defaults

dependencies:

- python=3.11

- numpy

- pyspark

- pip

- pip:

- azureml-core

- ipython

- ipykernel

- pyspark==3.5

EOF

echo "env.yaml written"

/anaconda/condabin/conda env create -f env.yaml

echo "Initializing new conda environment"

/anaconda/condabin/conda init bash

#!/bin/bash

set -e

python3 -m pip install ipykernel==v6.29.5

python3 -m ipykernel install --user --name python3.11_spark3.5 --display-name "Python 3.11 - Spark 3.5 (DSS)"

echo "Activating new conda environment"

/anaconda/envs/azureml_py38/bin/conda init bash

/anaconda/envs/azureml_py38/bin/conda activate python3.11_spark3.5

/anaconda/envs/azureml_py38/bin/conda install -y ipykernel anaconda::pyspark

echo "Installing kernel"

sudo -u azureuser -i <<'EOF'

python3 -m pip install pip --upgrade

pip3 install pyopenssl --upgrade

pip3 install pyspark==3.5

pip3 install snowflake-snowpark-python==1.20.0

pip3 install snowflake-connector-python==3.11.0

pip3 install azure-keyvault

pip3 install azure-identity

python3 -m pip install ipykernel==v6.29.5

echo "Conda environment setup successfully."

EOF

Previous articles: https://1drv.ms/w/s!Ashlm-Nw-wnWhPIt_-X-iYdnygX-fA?e=ZCKWsR

Thursday, August 1, 2024

This is the summary of the book titled “The Start-up of you” written by Reid Hoffman and Ben Casnocha and published by Crown in 2012. LinkedIn founder Reid Hoffman and venture capitalist Ben Casnocha advise that the workplace has changed with globalization and technology, and one must use their entrepreneurial roots to grow their career. This “entrepreneurial mindset” treats each day as “Day One”. Entrepreneurs need personal capital, goals, and an amenable market. No one is self-made. Success requires a strong social network. Individuals must learn to seize the moment, make friends with risk, and solicit information from relationships.

In the late 20th century, building a career was similar to an "escalator" - a traditional American path. However, globalization and the digital revolution have made this traditional American career path obsolete. Companies no longer offer professional career development support or training, and employees are now "free agents" who must adopt an entrepreneurial mindset.

Entrepreneurs need personal capital, goals, and an amenable market. They need a competitive advantage through assets, aspirations, values, and market realities. When developing a career plan, consider financial assets, hard skills, and soft skills. Review immediate and long-term goals to determine the best job for you.

Twenty-first century careers demand flexibility and the capacity to adapt. Companies may face unexpected competition or modern technology, creating fresh pressure on employees and employers. Entrepreneurs must be persistent in fulfilling their vision while adapting to market feedback and customer needs.

To adapt to changing circumstances and pivot in a new direction, entrepreneurs and career strivers can adopt a useful planning framework. They optimize their initial vision, deploy competitive advantages, and reformulate as needed.

Success in business requires a strong social network, as even solo entrepreneurs or start-up leaders need help from others. Building authentic relationships and collaborating with others is crucial for success. Entrepreneurs should seize the moment when opportunities arise, as transformative opportunities rarely crop up. Curious entrepreneurs find inspiration in unexpected places and events, making connections.

To take smart risks, entrepreneurs should weigh possible benefits against likely downsides. Risk tolerance varies, and each person's risk tolerance is different. Assessing risk effectively is essential, as it changes over time and with situations. Start-up entrepreneurs and people with varying job levels should decide whether to take risks or not.

In conclusion, entrepreneurs need to be proactive in helping and collaborating with others, seizing opportunities, and making friends with risk. By understanding and addressing risks, entrepreneurs can create a strong professional network and navigate the challenges of their careers.

Relationships provide crucial information for businesses and leaders, as well as entrepreneurs and ambitious professionals. LinkedIn, co-founded by Reid Hoffman, is an online platform that allows people to connect with professionals and share their professional identities. Network literacy is becoming increasingly important, as it allows individuals to find and utilize information from their networks. A person's social network is a unique sensor that provides insights on assorted topics. To make informed decisions, it is essential to ask well-formed questions and be generous with the people in your network.

In the “Age of the Inconceivable”, events like the coronavirus pandemic and climate change bring disastrous consequences. People who thrive during these events harness their entrepreneurial impulses, but even born entrepreneurs need to cultivate their natural entrepreneurial impulses systematically. In today's breakneck change and uncertainty, traditional career strategies and paths will not work. Career success requires adopting a startup entrepreneur mindset. The good news is that today's world is changing, and it is essential to adapt to these changes and embrace a startup entrepreneur mindset.

References:

Previous book summary: https://1drv.ms/w/s!Ashlm-Nw-wnWhPIrwpUwG1feNrJbyg?e=nTgNlk

https://1drv.ms/w/s!Ashlm-Nw-wnWhOYMyD1A8aq_fBqraA?e=2CuChd

Wednesday, July 31, 2024

Problem 4

The relationship "friend" is often symmetric, meaning that if I am your friend, you are my friend. Implement a MapReduce algorithm to check whether this property holds. Generate a list of all non-symmetric friend relationships.

Map Input

Each input record is a 2 element list [personA, personB] where personA is a string representing the name of a person and personB is a string representing the name of one of personA's friends. Note that it may or may not be the case that the personA is a friend of personB.

Reduce Output

The output should be all pairs (friend, person) such that (person, friend) appears in the dataset but (friend, person) does not.

You can test your solution to this problem using friends.json:

Answer:

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

people = ["Myriel","Geborand", "Champtercier", "Count", "OldMan", "Valjean", "Napoleon", "MlleBaptistine", "MmeMagloire", "Labarre", "Marguerite", "MmeDeR", "Isabeau", "Fantine", "Cosette", "Simplice", "Woman1", "Judge", "Woman2", "Gillenormand", "MlleGillenormand", "Babet", "Montparnasse"]

persons = []

# Part 2

def mapper(record):

for friend1 in people:

for friend2 in people:

if friend1 == friend2:

continue

if friend1 == record[0] and friend2 == record[1]:

mr.emit_intermediate((friend1,friend2), 1)

else:

mr.emit_intermediate((friend1,friend2), 0)

# Part 3

def reducer(key, list_of_values):

#print(repr((key, list_of_values)))

if 1 in list_of_values:

pass

else:

mr.emit(key)

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Sample output:

["MlleBaptistine", "Myriel"]

["MlleBaptistine", "MmeMagloire"]

["MlleBaptistine", "Valjean"]

["Fantine", "Valjean"]

["Cosette", "Valjean"]

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhPIxEJaEe9_uKGDHgg?e=mrXrYM

Tuesday, July 30, 2024

Problem 4

Map Input

Reduce Output

The output should be all pairs (friend, person) such that (person, friend) appears in the dataset but (friend, person) does not.

You can test your solution to this problem using friends.json:

Answer:

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

persons = []

# Part 2

def mapper(record):

for friend1 in people:

for friend2 in people:

if friend1 == friend2:

continue

if friend1 == record[0] and friend2 == record[1]:

mr.emit_intermediate((friend1,friend2), 1)

else:

mr.emit_intermediate((friend1,friend2), 0)

# Part 3

def reducer(key, list_of_values):

#print(repr((key, list_of_values)))

if 1 in list_of_values:

pass

else:

mr.emit(key)

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Sample output:

["MlleBaptistine", "Myriel"]

["MlleBaptistine", "MmeMagloire"]

["MlleBaptistine", "Valjean"]

["Fantine", "Valjean"]

["Cosette", "Valjean"]

Monday, July 29, 2024

Problem 1

Create an Inverted index. Given a set of documents, an inverted index is a dictionary where each word is associated with a list of the document identifiers in which that word appears.

Mapper Input

The input is a 2-element list: [document_id, text], where document_id is a string representing a document identifier and text is a string representing the text of the document. The document text may have words in upper or lower case and may contain punctuation. You should treat each token as if it was a valid word; that is, you can just use value.split() to tokenize the string.

Reducer Output

The output should be a (word, document ID list) tuple where word is a String and document ID list is a list of Strings.

You can test your solution to this problem using books.json:

python inverted_index.py books.json

You can verify your solution against inverted_index.json.

Answer:

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

# Part 2

def mapper(record):

for word in record[1].split():

mr.emit_intermediate(word, record[0])

# Part 3

def reducer(key, list_of_values):

mr.emit((key, list_of_values))

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Sample Output:

["all", ["milton-paradise.txt", "blake-poems.txt", "melville-moby_dick.txt"]]

["Rossmore", ["edgeworth-parents.txt"]]

["Consumptive", ["melville-moby_dick.txt"]]

["forbidden", ["milton-paradise.txt"]]

["child", ["blake-poems.txt"]]

Sunday, July 28, 2024

Problem statement:

Assume you have two matrices A and B in a sparse matrix format, where each record is of the form i, j, value. Design a MapReduce algorithm to compute the matrix multiplication A x B

Map Input

The input to the map function will be a row of a matrix represented as a list. Each list will be of the form [matrix, i, j, value] where matrix is a string and i, j, and value are integers.

The first item, matrix, is a string that identifies which matrix the record originates from. This field has two possible values: "a" indicates that the record is from matrix A and "b" indicates that the record is from matrix B.

Reduce Output

The output from the reduce function will also be a row of the result matrix represented as a tuple. Each tuple will be of the form (i, j, value) where each element is an integer.

Answer:

#!/usr/bin/python

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

# A has dimensions L,M

# B has dimensions M,N

L = 5

M = 5

N = 5

# Part 2

def mapper(record):

print(f"record={record} \t + {record[0]} + \t + {record[1]} + \t + {record[2]} + \t + {record[3]}")

matrix_index = record[0]

row = record[1]

col = record[2]

value = record[3]

if matrix_index == "a":

for i in range(0, N):

key = f"{row},{i}"

mr.emit_intermediate(key, ("a", row, col, value))

if matrix_index == "b":

for j in range(0, L):

key = f"{j},{col}"

mr.emit_intermediate(key, ("b", row, col, value))

# Part 3

def reducer(key, list_of_values):

# one reducer per output cell of destination matrix

# print(f"{key},{list_of_values}")

total = 0

line = ""

for k in range(0,M):

left = getcolumn(list_of_values, k, "a")

right = getrow(list_of_values, k, "b")

total += left*right

line += f"{left}*{right}={left*right} +"

line += f"= {total}"

print(line)

mr.emit((int(key.split(',')[0]), int(key.split(',')[1]), total))

def getcolumn(values, k, matrix_type):

result = 0

for item in values:

mtype = item[0]

row = item[1]

col = item[2]

value = item[3]

if mtype == matrix_type and col == k:

result = value

break

return result

def getrow(values, k, matrix_type):

result = 0

for item in values:

mtype = item[0]

row = item[1]

col = item[2]

value = item[3]

if matrix_type == mtype and row == k:

result = value

break

return result

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Output:

python3 multiply1.py matrix.json

record=['a', 0, 0, 63] + a + + 0 + + 0 + + 63

record=['a', 0, 1, 45] + a + + 0 + + 1 + + 45

record=['a', 0, 2, 93] + a + + 0 + + 2 + + 93

record=['a', 0, 3, 32] + a + + 0 + + 3 + + 32

record=['a', 0, 4, 49] + a + + 0 + + 4 + + 49

record=['a', 1, 0, 33] + a + + 1 + + 0 + + 33

record=['a', 1, 3, 26] + a + + 1 + + 3 + + 26

record=['a', 1, 4, 95] + a + + 1 + + 4 + + 95

record=['a', 2, 0, 25] + a + + 2 + + 0 + + 25

record=['a', 2, 1, 11] + a + + 2 + + 1 + + 11

record=['a', 2, 3, 60] + a + + 2 + + 3 + + 60

record=['a', 2, 4, 89] + a + + 2 + + 4 + + 89

record=['a', 3, 0, 24] + a + + 3 + + 0 + + 24

record=['a', 3, 1, 79] + a + + 3 + + 1 + + 79

record=['a', 3, 2, 24] + a + + 3 + + 2 + + 24

record=['a', 3, 3, 47] + a + + 3 + + 3 + + 47

record=['a', 3, 4, 18] + a + + 3 + + 4 + + 18

record=['a', 4, 0, 7] + a + + 4 + + 0 + + 7

record=['a', 4, 1, 98] + a + + 4 + + 1 + + 98

record=['a', 4, 2, 96] + a + + 4 + + 2 + + 96

record=['a', 4, 3, 27] + a + + 4 + + 3 + + 27

record=['b', 0, 0, 63] + b + + 0 + + 0 + + 63

record=['b', 0, 1, 18] + b + + 0 + + 1 + + 18

record=['b', 0, 2, 89] + b + + 0 + + 2 + + 89

record=['b', 0, 3, 28] + b + + 0 + + 3 + + 28

record=['b', 0, 4, 39] + b + + 0 + + 4 + + 39

record=['b', 1, 0, 59] + b + + 1 + + 0 + + 59

record=['b', 1, 1, 76] + b + + 1 + + 1 + + 76

record=['b', 1, 2, 34] + b + + 1 + + 2 + + 34

record=['b', 1, 3, 12] + b + + 1 + + 3 + + 12

record=['b', 1, 4, 6] + b + + 1 + + 4 + + 6

record=['b', 2, 0, 30] + b + + 2 + + 0 + + 30

record=['b', 2, 1, 52] + b + + 2 + + 1 + + 52

record=['b', 2, 2, 49] + b + + 2 + + 2 + + 49

record=['b', 2, 3, 3] + b + + 2 + + 3 + + 3

record=['b', 2, 4, 95] + b + + 2 + + 4 + + 95

record=['b', 3, 0, 77] + b + + 3 + + 0 + + 77

record=['b', 3, 1, 75] + b + + 3 + + 1 + + 75

record=['b', 3, 2, 85] + b + + 3 + + 2 + + 85

record=['b', 4, 1, 46] + b + + 4 + + 1 + + 46

record=['b', 4, 2, 33] + b + + 4 + + 2 + + 33

record=['b', 4, 3, 69] + b + + 4 + + 3 + + 69

record=['b', 4, 4, 88] + b + + 4 + + 4 + + 88

63*63=3969 +45*59=2655 +93*30=2790 +32*77=2464 +49*0=0 += 11878

63*18=1134 +45*76=3420 +93*52=4836 +32*75=2400 +49*46=2254 += 14044

63*89=5607 +45*34=1530 +93*49=4557 +32*85=2720 +49*33=1617 += 16031

63*28=1764 +45*12=540 +93*3=279 +32*0=0 +49*69=3381 += 5964

63*39=2457 +45*6=270 +93*95=8835 +32*0=0 +49*88=4312 += 15874

33*63=2079 +0*59=0 +0*30=0 +26*77=2002 +95*0=0 += 4081

33*18=594 +0*76=0 +0*52=0 +26*75=1950 +95*46=4370 += 6914

33*89=2937 +0*34=0 +0*49=0 +26*85=2210 +95*33=3135 += 8282

33*28=924 +0*12=0 +0*3=0 +26*0=0 +95*69=6555 += 7479

33*39=1287 +0*6=0 +0*95=0 +26*0=0 +95*88=8360 += 9647

25*63=1575 +11*59=649 +0*30=0 +60*77=4620 +89*0=0 += 6844

25*18=450 +11*76=836 +0*52=0 +60*75=4500 +89*46=4094 += 9880

25*89=2225 +11*34=374 +0*49=0 +60*85=5100 +89*33=2937 += 10636

25*28=700 +11*12=132 +0*3=0 +60*0=0 +89*69=6141 += 6973

25*39=975 +11*6=66 +0*95=0 +60*0=0 +89*88=7832 += 8873

24*63=1512 +79*59=4661 +24*30=720 +47*77=3619 +18*0=0 += 10512

24*18=432 +79*76=6004 +24*52=1248 +47*75=3525 +18*46=828 += 12037

24*89=2136 +79*34=2686 +24*49=1176 +47*85=3995 +18*33=594 += 10587

24*28=672 +79*12=948 +24*3=72 +47*0=0 +18*69=1242 += 2934

24*39=936 +79*6=474 +24*95=2280 +47*0=0 +18*88=1584 += 5274

7*63=441 +98*59=5782 +96*30=2880 +27*77=2079 +0*0=0 += 11182

7*18=126 +98*76=7448 +96*52=4992 +27*75=2025 +0*46=0 += 14591

7*89=623 +98*34=3332 +96*49=4704 +27*85=2295 +0*33=0 += 10954

7*28=196 +98*12=1176 +96*3=288 +27*0=0 +0*69=0 += 1660

7*39=273 +98*6=588 +96*95=9120 +27*0=0 +0*88=0 += 9981

[0, 0, 11878]

[0, 1, 14044]

[0, 2, 16031]

[0, 3, 5964]

[0, 4, 15874]

[1, 0, 4081]

[1, 1, 6914]

[1, 2, 8282]

[1, 3, 7479]

[1, 4, 9647]

[2, 0, 6844]

[2, 1, 9880]

[2, 2, 10636]

[2, 3, 6973]

[2, 4, 8873]

[3, 0, 10512]

[3, 1, 12037]

[3, 2, 10587]

[3, 3, 2934]

[3, 4, 5274]

[4, 0, 11182]

[4, 1, 14591]

[4, 2, 10954]

[4, 3, 1660]

[4, 4, 9981]

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhM0bmlY_ggTBTNTYxQ?e=s7hP7W

Saturday, July 27, 2024

Given that tweets have location, find the happiest state:

Answer: happiest_state.py:

import sys

def hw():

afinnfile = open("AFINN-111.txt")

scores = {} # initialize an empty dictionary

for line in afinnfile:

term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"

scores[term] = int(score) # Convert the score to an integer.

print scores.items()

import json

outputfile = open("output.txt")

tweets = []

for line in outputfile:

tweets += [json.loads(line)]

nonsentiment_scores = []

for item in tweets:

if item.text:

sentence = trim(item.text)

words = sentence.split()

score = 0

for i, word in enumerate(words, start=1):

term = tolower(trim(word))

if term not in scores:

if i-1 > 0 && is_present(scores, words[i-1]):

score += get_score(scores, words[i-1]) > 0 ? 1 : -1

if i+1 < len(words) && is_present(scores,words[i+1]):

score += get_score(scores, words[i-1]) > 0 ? 1 : -1

score = score/3

nonsentiment_scores.append(tolower(trim(word)), score)

for item in nonsentiment_scores:

print(item)

def is_present(scores, word):

term = tolower(trim(word))

return term in scores

def get_score(scores, word):

score = 0

term = tolower(trim(word))

if term in scores:

if scores[term] > 0:

score += 1

else if scores[term] < 0:

score -= 1

else:

score += 0

return score

def lines(fp):

print str(len(fp.readlines()))

states = {

'AK': 'Alaska',

'AL': 'Alabama',

'AR': 'Arkansas',

'AS': 'American Samoa',

'AZ': 'Arizona',

'CA': 'California',

'CO': 'Colorado',

'CT': 'Connecticut',

'DC': 'District of Columbia',

'DE': 'Delaware',

'FL': 'Florida',

'GA': 'Georgia',

'GU': 'Guam',

'HI': 'Hawaii',

'IA': 'Iowa',

'ID': 'Idaho',

'IL': 'Illinois',

'IN': 'Indiana',

'KS': 'Kansas',

'KY': 'Kentucky',

'LA': 'Louisiana',

'MA': 'Massachusetts',

'MD': 'Maryland',

'ME': 'Maine',

'MI': 'Michigan',

'MN': 'Minnesota',

'MO': 'Missouri',

'MP': 'Northern Mariana Islands',

'MS': 'Mississippi',

'MT': 'Montana',

'NA': 'National',

'NC': 'North Carolina',

'ND': 'North Dakota',

'NE': 'Nebraska',

'NH': 'New Hampshire',

'NJ': 'New Jersey',

'NM': 'New Mexico',

'NV': 'Nevada',

'NY': 'New York',

'OH': 'Ohio',

'OK': 'Oklahoma',

'OR': 'Oregon',

'PA': 'Pennsylvania',

'PR': 'Puerto Rico',

'RI': 'Rhode Island',

'SC': 'South Carolina',

'SD': 'South Dakota',

'TN': 'Tennessee',

'TX': 'Texas',

'UT': 'Utah',

'VA': 'Virginia',

'VI': 'Virgin Islands',

'VT': 'Vermont',

'WA': 'Washington',

'WI': 'Wisconsin',

'WV': 'West Virginia',

'WY': 'Wyoming'

}

def main():

sent_file = open(sys.argv[1])

tweet_file = open(sys.argv[2])

hw()

lines(sent_file)

lines(tweet_file)

if __name__ == '__main__':

main()