Problem 1
Create an Inverted index. Given a set of documents, an inverted index is a dictionary where each word is associated with a list of the document identifiers in which that word appears.
Mapper Input
The input is a 2-element list: [document_id, text], where document_id is a string representing a document identifier and text is a string representing the text of the document. The document text may have words in upper or lower case and may contain punctuation. You should treat each token as if it was a valid word; that is, you can just use value.split() to tokenize the string.
Reducer Output
The output should be a (word, document ID list) tuple where word is a String and document ID list is a list of Strings.
You can test your solution to this problem using books.json:
1
python inverted_index.py books.json
You can verify your solution against inverted_index.json.
Answer:
import MapReduce
import json
import sys
# Part 1
mr = MapReduce.MapReduce()
# Part 2
def mapper(record):
for word in record[1].split():
mr.emit_intermediate(word, record[0])
# Part 3
def reducer(key, list_of_values):
mr.emit((key, list_of_values))
# Part 4
inputdata = open(sys.argv[1])
mr.execute(inputdata, mapper, reducer)
Sample Output:
["all", ["milton-paradise.txt", "blake-poems.txt", "melville-moby_dick.txt"]]
["Rossmore", ["edgeworth-parents.txt"]]
["Consumptive", ["melville-moby_dick.txt"]]
["forbidden", ["milton-paradise.txt"]]
["child", ["blake-poems.txt"]]
No comments:
Post a Comment