Cluster computing

Wednesday, July 31, 2024

Problem 4

The relationship "friend" is often symmetric, meaning that if I am your friend, you are my friend. Implement a MapReduce algorithm to check whether this property holds. Generate a list of all non-symmetric friend relationships.

Map Input

Each input record is a 2 element list [personA, personB] where personA is a string representing the name of a person and personB is a string representing the name of one of personA's friends. Note that it may or may not be the case that the personA is a friend of personB.

Reduce Output

The output should be all pairs (friend, person) such that (person, friend) appears in the dataset but (friend, person) does not.

You can test your solution to this problem using friends.json:

Answer:

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

people = ["Myriel","Geborand", "Champtercier", "Count", "OldMan", "Valjean", "Napoleon", "MlleBaptistine", "MmeMagloire", "Labarre", "Marguerite", "MmeDeR", "Isabeau", "Fantine", "Cosette", "Simplice", "Woman1", "Judge", "Woman2", "Gillenormand", "MlleGillenormand", "Babet", "Montparnasse"]

persons = []

# Part 2

def mapper(record):

for friend1 in people:

for friend2 in people:

if friend1 == friend2:

continue

if friend1 == record[0] and friend2 == record[1]:

mr.emit_intermediate((friend1,friend2), 1)

else:

mr.emit_intermediate((friend1,friend2), 0)

# Part 3

def reducer(key, list_of_values):

#print(repr((key, list_of_values)))

if 1 in list_of_values:

pass

else:

mr.emit(key)

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Sample output:

["MlleBaptistine", "Myriel"]

["MlleBaptistine", "MmeMagloire"]

["MlleBaptistine", "Valjean"]

["Fantine", "Valjean"]

["Cosette", "Valjean"]

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhPIxEJaEe9_uKGDHgg?e=mrXrYM

Tuesday, July 30, 2024

Problem 4

Map Input

Reduce Output

The output should be all pairs (friend, person) such that (person, friend) appears in the dataset but (friend, person) does not.

You can test your solution to this problem using friends.json:

Answer:

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

persons = []

# Part 2

def mapper(record):

for friend1 in people:

for friend2 in people:

if friend1 == friend2:

continue

if friend1 == record[0] and friend2 == record[1]:

mr.emit_intermediate((friend1,friend2), 1)

else:

mr.emit_intermediate((friend1,friend2), 0)

# Part 3

def reducer(key, list_of_values):

#print(repr((key, list_of_values)))

if 1 in list_of_values:

pass

else:

mr.emit(key)

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Sample output:

["MlleBaptistine", "Myriel"]

["MlleBaptistine", "MmeMagloire"]

["MlleBaptistine", "Valjean"]

["Fantine", "Valjean"]

["Cosette", "Valjean"]

Monday, July 29, 2024

Problem 1

Create an Inverted index. Given a set of documents, an inverted index is a dictionary where each word is associated with a list of the document identifiers in which that word appears.

Mapper Input

The input is a 2-element list: [document_id, text], where document_id is a string representing a document identifier and text is a string representing the text of the document. The document text may have words in upper or lower case and may contain punctuation. You should treat each token as if it was a valid word; that is, you can just use value.split() to tokenize the string.

Reducer Output

The output should be a (word, document ID list) tuple where word is a String and document ID list is a list of Strings.

You can test your solution to this problem using books.json:

python inverted_index.py books.json

You can verify your solution against inverted_index.json.

Answer:

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

# Part 2

def mapper(record):

for word in record[1].split():

mr.emit_intermediate(word, record[0])

# Part 3

def reducer(key, list_of_values):

mr.emit((key, list_of_values))

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Sample Output:

["all", ["milton-paradise.txt", "blake-poems.txt", "melville-moby_dick.txt"]]

["Rossmore", ["edgeworth-parents.txt"]]

["Consumptive", ["melville-moby_dick.txt"]]

["forbidden", ["milton-paradise.txt"]]

["child", ["blake-poems.txt"]]

Sunday, July 28, 2024

Problem statement:

Assume you have two matrices A and B in a sparse matrix format, where each record is of the form i, j, value. Design a MapReduce algorithm to compute the matrix multiplication A x B

Map Input

The input to the map function will be a row of a matrix represented as a list. Each list will be of the form [matrix, i, j, value] where matrix is a string and i, j, and value are integers.

The first item, matrix, is a string that identifies which matrix the record originates from. This field has two possible values: "a" indicates that the record is from matrix A and "b" indicates that the record is from matrix B.

Reduce Output

The output from the reduce function will also be a row of the result matrix represented as a tuple. Each tuple will be of the form (i, j, value) where each element is an integer.

Answer:

#!/usr/bin/python

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

# A has dimensions L,M

# B has dimensions M,N

L = 5

M = 5

N = 5

# Part 2

def mapper(record):

print(f"record={record} \t + {record[0]} + \t + {record[1]} + \t + {record[2]} + \t + {record[3]}")

matrix_index = record[0]

row = record[1]

col = record[2]

value = record[3]

if matrix_index == "a":

for i in range(0, N):

key = f"{row},{i}"

mr.emit_intermediate(key, ("a", row, col, value))

if matrix_index == "b":

for j in range(0, L):

key = f"{j},{col}"

mr.emit_intermediate(key, ("b", row, col, value))

# Part 3

def reducer(key, list_of_values):

# one reducer per output cell of destination matrix

# print(f"{key},{list_of_values}")

total = 0

line = ""

for k in range(0,M):

left = getcolumn(list_of_values, k, "a")

right = getrow(list_of_values, k, "b")

total += left*right

line += f"{left}*{right}={left*right} +"

line += f"= {total}"

print(line)

mr.emit((int(key.split(',')[0]), int(key.split(',')[1]), total))

def getcolumn(values, k, matrix_type):

result = 0

for item in values:

mtype = item[0]

row = item[1]

col = item[2]

value = item[3]

if mtype == matrix_type and col == k:

result = value

break

return result

def getrow(values, k, matrix_type):

result = 0

for item in values:

mtype = item[0]

row = item[1]

col = item[2]

value = item[3]

if matrix_type == mtype and row == k:

result = value

break

return result

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Output:

python3 multiply1.py matrix.json

record=['a', 0, 0, 63] + a + + 0 + + 0 + + 63

record=['a', 0, 1, 45] + a + + 0 + + 1 + + 45

record=['a', 0, 2, 93] + a + + 0 + + 2 + + 93

record=['a', 0, 3, 32] + a + + 0 + + 3 + + 32

record=['a', 0, 4, 49] + a + + 0 + + 4 + + 49

record=['a', 1, 0, 33] + a + + 1 + + 0 + + 33

record=['a', 1, 3, 26] + a + + 1 + + 3 + + 26

record=['a', 1, 4, 95] + a + + 1 + + 4 + + 95

record=['a', 2, 0, 25] + a + + 2 + + 0 + + 25

record=['a', 2, 1, 11] + a + + 2 + + 1 + + 11

record=['a', 2, 3, 60] + a + + 2 + + 3 + + 60

record=['a', 2, 4, 89] + a + + 2 + + 4 + + 89

record=['a', 3, 0, 24] + a + + 3 + + 0 + + 24

record=['a', 3, 1, 79] + a + + 3 + + 1 + + 79

record=['a', 3, 2, 24] + a + + 3 + + 2 + + 24

record=['a', 3, 3, 47] + a + + 3 + + 3 + + 47

record=['a', 3, 4, 18] + a + + 3 + + 4 + + 18

record=['a', 4, 0, 7] + a + + 4 + + 0 + + 7

record=['a', 4, 1, 98] + a + + 4 + + 1 + + 98

record=['a', 4, 2, 96] + a + + 4 + + 2 + + 96

record=['a', 4, 3, 27] + a + + 4 + + 3 + + 27

record=['b', 0, 0, 63] + b + + 0 + + 0 + + 63

record=['b', 0, 1, 18] + b + + 0 + + 1 + + 18

record=['b', 0, 2, 89] + b + + 0 + + 2 + + 89

record=['b', 0, 3, 28] + b + + 0 + + 3 + + 28

record=['b', 0, 4, 39] + b + + 0 + + 4 + + 39

record=['b', 1, 0, 59] + b + + 1 + + 0 + + 59

record=['b', 1, 1, 76] + b + + 1 + + 1 + + 76

record=['b', 1, 2, 34] + b + + 1 + + 2 + + 34

record=['b', 1, 3, 12] + b + + 1 + + 3 + + 12

record=['b', 1, 4, 6] + b + + 1 + + 4 + + 6

record=['b', 2, 0, 30] + b + + 2 + + 0 + + 30

record=['b', 2, 1, 52] + b + + 2 + + 1 + + 52

record=['b', 2, 2, 49] + b + + 2 + + 2 + + 49

record=['b', 2, 3, 3] + b + + 2 + + 3 + + 3

record=['b', 2, 4, 95] + b + + 2 + + 4 + + 95

record=['b', 3, 0, 77] + b + + 3 + + 0 + + 77

record=['b', 3, 1, 75] + b + + 3 + + 1 + + 75

record=['b', 3, 2, 85] + b + + 3 + + 2 + + 85

record=['b', 4, 1, 46] + b + + 4 + + 1 + + 46

record=['b', 4, 2, 33] + b + + 4 + + 2 + + 33

record=['b', 4, 3, 69] + b + + 4 + + 3 + + 69

record=['b', 4, 4, 88] + b + + 4 + + 4 + + 88

63*63=3969 +45*59=2655 +93*30=2790 +32*77=2464 +49*0=0 += 11878

63*18=1134 +45*76=3420 +93*52=4836 +32*75=2400 +49*46=2254 += 14044

63*89=5607 +45*34=1530 +93*49=4557 +32*85=2720 +49*33=1617 += 16031

63*28=1764 +45*12=540 +93*3=279 +32*0=0 +49*69=3381 += 5964

63*39=2457 +45*6=270 +93*95=8835 +32*0=0 +49*88=4312 += 15874

33*63=2079 +0*59=0 +0*30=0 +26*77=2002 +95*0=0 += 4081

33*18=594 +0*76=0 +0*52=0 +26*75=1950 +95*46=4370 += 6914

33*89=2937 +0*34=0 +0*49=0 +26*85=2210 +95*33=3135 += 8282

33*28=924 +0*12=0 +0*3=0 +26*0=0 +95*69=6555 += 7479

33*39=1287 +0*6=0 +0*95=0 +26*0=0 +95*88=8360 += 9647

25*63=1575 +11*59=649 +0*30=0 +60*77=4620 +89*0=0 += 6844

25*18=450 +11*76=836 +0*52=0 +60*75=4500 +89*46=4094 += 9880

25*89=2225 +11*34=374 +0*49=0 +60*85=5100 +89*33=2937 += 10636

25*28=700 +11*12=132 +0*3=0 +60*0=0 +89*69=6141 += 6973

25*39=975 +11*6=66 +0*95=0 +60*0=0 +89*88=7832 += 8873

24*63=1512 +79*59=4661 +24*30=720 +47*77=3619 +18*0=0 += 10512

24*18=432 +79*76=6004 +24*52=1248 +47*75=3525 +18*46=828 += 12037

24*89=2136 +79*34=2686 +24*49=1176 +47*85=3995 +18*33=594 += 10587

24*28=672 +79*12=948 +24*3=72 +47*0=0 +18*69=1242 += 2934

24*39=936 +79*6=474 +24*95=2280 +47*0=0 +18*88=1584 += 5274

7*63=441 +98*59=5782 +96*30=2880 +27*77=2079 +0*0=0 += 11182

7*18=126 +98*76=7448 +96*52=4992 +27*75=2025 +0*46=0 += 14591

7*89=623 +98*34=3332 +96*49=4704 +27*85=2295 +0*33=0 += 10954

7*28=196 +98*12=1176 +96*3=288 +27*0=0 +0*69=0 += 1660

7*39=273 +98*6=588 +96*95=9120 +27*0=0 +0*88=0 += 9981

[0, 0, 11878]

[0, 1, 14044]

[0, 2, 16031]

[0, 3, 5964]

[0, 4, 15874]

[1, 0, 4081]

[1, 1, 6914]

[1, 2, 8282]

[1, 3, 7479]

[1, 4, 9647]

[2, 0, 6844]

[2, 1, 9880]

[2, 2, 10636]

[2, 3, 6973]

[2, 4, 8873]

[3, 0, 10512]

[3, 1, 12037]

[3, 2, 10587]

[3, 3, 2934]

[3, 4, 5274]

[4, 0, 11182]

[4, 1, 14591]

[4, 2, 10954]

[4, 3, 1660]

[4, 4, 9981]

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhM0bmlY_ggTBTNTYxQ?e=s7hP7W

Saturday, July 27, 2024

Given that tweets have location, find the happiest state:

Answer: happiest_state.py:

import sys

def hw():

afinnfile = open("AFINN-111.txt")

scores = {} # initialize an empty dictionary

for line in afinnfile:

term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"

scores[term] = int(score) # Convert the score to an integer.

print scores.items()

import json

outputfile = open("output.txt")

tweets = []

for line in outputfile:

tweets += [json.loads(line)]

nonsentiment_scores = []

for item in tweets:

if item.text:

sentence = trim(item.text)

words = sentence.split()

score = 0

for i, word in enumerate(words, start=1):

term = tolower(trim(word))

if term not in scores:

if i-1 > 0 && is_present(scores, words[i-1]):

score += get_score(scores, words[i-1]) > 0 ? 1 : -1

if i+1 < len(words) && is_present(scores,words[i+1]):

score += get_score(scores, words[i-1]) > 0 ? 1 : -1

score = score/3

nonsentiment_scores.append(tolower(trim(word)), score)

for item in nonsentiment_scores:

print(item)

def is_present(scores, word):

term = tolower(trim(word))

return term in scores

def get_score(scores, word):

score = 0

term = tolower(trim(word))

if term in scores:

if scores[term] > 0:

score += 1

else if scores[term] < 0:

score -= 1

else:

score += 0

return score

def lines(fp):

print str(len(fp.readlines()))

states = {

'AK': 'Alaska',

'AL': 'Alabama',

'AR': 'Arkansas',

'AS': 'American Samoa',

'AZ': 'Arizona',

'CA': 'California',

'CO': 'Colorado',

'CT': 'Connecticut',

'DC': 'District of Columbia',

'DE': 'Delaware',

'FL': 'Florida',

'GA': 'Georgia',

'GU': 'Guam',

'HI': 'Hawaii',

'IA': 'Iowa',

'ID': 'Idaho',

'IL': 'Illinois',

'IN': 'Indiana',

'KS': 'Kansas',

'KY': 'Kentucky',

'LA': 'Louisiana',

'MA': 'Massachusetts',

'MD': 'Maryland',

'ME': 'Maine',

'MI': 'Michigan',

'MN': 'Minnesota',

'MO': 'Missouri',

'MP': 'Northern Mariana Islands',

'MS': 'Mississippi',

'MT': 'Montana',

'NA': 'National',

'NC': 'North Carolina',

'ND': 'North Dakota',

'NE': 'Nebraska',

'NH': 'New Hampshire',

'NJ': 'New Jersey',

'NM': 'New Mexico',

'NV': 'Nevada',

'NY': 'New York',

'OH': 'Ohio',

'OK': 'Oklahoma',

'OR': 'Oregon',

'PA': 'Pennsylvania',

'PR': 'Puerto Rico',

'RI': 'Rhode Island',

'SC': 'South Carolina',

'SD': 'South Dakota',

'TN': 'Tennessee',

'TX': 'Texas',

'UT': 'Utah',

'VA': 'Virginia',

'VI': 'Virgin Islands',

'VT': 'Vermont',

'WA': 'Washington',

'WI': 'Wisconsin',

'WV': 'West Virginia',

'WY': 'Wyoming'

}

def main():

sent_file = open(sys.argv[1])

tweet_file = open(sys.argv[2])

hw()

lines(sent_file)

lines(tweet_file)

if __name__ == '__main__':

main()

Friday, July 26, 2024

Tweet sentiment analyzer:

import sys

def hw():

afinnfile = open("AFINN-111.txt")

scores = {} # initialize an empty dictionary

for line in afinnfile:

term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"

scores[term] = int(score) # Convert the score to an integer.

print scores.items()

import json

outputfile = open("output.txt")

tweets = []

for line in outputfile:

tweets += [json.loads(line)]

for item in tweets:

if item.text:

sentence = trim(item.text)

words = sentence.split()

score = 0

for word in words:

term = tolower(trim(word))

if term in scores:

if scores[term] > 0:

score += 1

else if scores[term] < 0:

score -= 1

else:

score += 0

if len(words) > 0:

score = score/len(words)

print(score)

else:

print(0)

def lines(fp):

print str(len(fp.readlines()))

def main():

sent_file = open(sys.argv[1])

tweet_file = open(sys.argv[2])

hw()

lines(sent_file)

lines(tweet_file)

if __name__ == '__main__':

main()

Thursday, July 25, 2024

This is a continuation of previous articles on Azure resources, their IaC deployments and trends in data infrastructure. The previous article touched upon data platforms and how they go out of their way to recommend that data must not be given to vendors and not even to the platform and that it is proprietary. This section continues that line of discussion to elaborate on understanding data.

The role of data in modern business operations is changing, with organizations facing the challenge of harnessing their potential and safeguarding it with utmost care. Data governance is crucial for businesses to ensure the protection, governance, and effective management of their data assets. Compliance frameworks like the EU's AI Act highlight the importance of maintaining high-quality data for successful AI integration and utilization.

The complex web of data governance presents multifaceted challenges, especially in the realm of data silos and disparate governance mechanisms. Tracking data provenance, ensuring data visibility, and implementing robust protection schemes are crucial for mitigating cybersecurity risks and ensuring data integrity across various platforms and applications.

The evolution of artificial intelligence (AI) introduces new dimensions to data management practices, as organizations explore the transformative potential of AI and machine learning technologies. Leveraging AI for tasks like backup recovery, compliance, and data protection plans offers unprecedented opportunities for enhancing operational efficiencies and driving innovation within businesses.

The future of data management lies at the intersection of compliance, resilience, security, backup, recovery, and AI integration. By embracing these foundational pillars, businesses can navigate the intricate landscape of data governance with agility and foresight, paving the way for sustainable data-driven strategies and robust cybersecurity protocols.

Prioritizing data management practices that align with compliance standards and cybersecurity best practices is key. By embracing the transformative potential of AI while maintaining a steadfast commitment to data protection, businesses can navigate the complexities of the digital landscape with confidence and resilience.

References:

Previous article explaining a catalog: IaCResolutionsPart148.docx

https://docs.databricks.com/en/data-governance/unity-catalog/enable-workspaces.html#enable-workspace

https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWhPIMgfH3QDAPfwCW6Q?e=dM89NH