Cluster computing

Monday, July 29, 2024

Problem 1

Create an Inverted index. Given a set of documents, an inverted index is a dictionary where each word is associated with a list of the document identifiers in which that word appears.

Mapper Input

The input is a 2-element list: [document_id, text], where document_id is a string representing a document identifier and text is a string representing the text of the document. The document text may have words in upper or lower case and may contain punctuation. You should treat each token as if it was a valid word; that is, you can just use value.split() to tokenize the string.

Reducer Output

The output should be a (word, document ID list) tuple where word is a String and document ID list is a list of Strings.

You can test your solution to this problem using books.json:

python inverted_index.py books.json

You can verify your solution against inverted_index.json.

Answer:

import MapReduce

import json

import sys

# Part 1

mr = MapReduce.MapReduce()

# Part 2

def mapper(record):

for word in record[1].split():

mr.emit_intermediate(word, record[0])

# Part 3

def reducer(key, list_of_values):

mr.emit((key, list_of_values))

# Part 4

inputdata = open(sys.argv[1])

mr.execute(inputdata, mapper, reducer)

Sample Output:

["all", ["milton-paradise.txt", "blake-poems.txt", "melville-moby_dick.txt"]]

["Rossmore", ["edgeworth-parents.txt"]]

["Consumptive", ["melville-moby_dick.txt"]]

["forbidden", ["milton-paradise.txt"]]

["child", ["blake-poems.txt"]]

Cluster computing

Monday, July 29, 2024

No comments:

Post a Comment