Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


ECS765P (2019-2020)

ECS765P           Big Data Processing

SOLUTIONS AND MARKING SCHEME


Question 1

You have a dataset from a non-profit organisation listing the full history of member contri- butions for emergency appeals when e.g. a disaster happens.

For the next emergency appeals campaign, the organization is aiming to send postal envelopes to the 500 highest contributors up to this point.

The dataset is a collection of rows, each one registering an individual contribution from a member in the following format:


timestamp;memberId;isMultipleDonor;donationAmount;memberAge


The isMultipleDonor field is a boolean field that will be True if the member has made more than one donation. donationAmount contains the value of that individual donation, whereas memberAge includes for how many months they have been a member.

(a)   (i)  Design a combination of *two* MapReduce programs that computes the top 500

members with highest individual average contributions to these urgent appeal campaigns. You should only consider members with more than one individual contribution.

The code flow must be explained, discussing the input and output of each Map and Reduce function that has been defined (without detailing the implementation). You may use a diagram to illustrate the overall data flow.

Note: while a single MapReduce solution is technically possible, it would result in potentially some scalability/performance issues. Single MapReduce solutions should mention these aspects specifically.

(ii)  Write the pseudocode for one of the two MapReduce jobs you designed in the

previous question.  State in your solutions any assumptions that are made as part of the program, as well as the behaviour of any custom function you deem necessary.

[15 marks basic]


Solution:

The rst program will compute the average contribution of each member from the dataset that is a multiple times donor . That input will be fed to the second program that will compute the global average (numerical summarisation) of all these results . First mapper will emit (memberId, donationAmount) , First reducer will emit (memberId, averageAmount) Second mapper will emit (None, (memberId,averageAmount)), Second reducer will emit 500 x (memberId, ranking, averageAmount)

Either of these two MR programs is sufcient for the 5 marks




def mapper1(self, _, line):\mk{1}

fields = line.split(";")

multipleDonor = fields[2]

if(multipleDonor):\mk{1}

memberId = fields[1]

amount = fields[3]

yield (memberId, amount)\mk{1}

def reducer1(self, memberId, values):\mk{1}

average = computeAverage(values)

yield(memberId, average) \mk{1}

def mapper2(self, memberId, average): \mk{1}

yield(None, (memberId, average)) \mk{1}

def reducer2(self, _, values): \mk{1}

sorted = values.sortByValue()

top500 = sorted.getTop(500)\mk{1}

for item in top500:

yield(item[0], item[1]) \mk{1}

(b)  This question is about the Combiner in the context of MapReduce jobs.

(i)  Explain who runs the Combiner function and at what point in time Combiner functions execute during the shuffle and sort stage of MapReduce jobs.

(ii)  Discuss whether the *second* job from 1a) would benefit from using a Combiner.

To do so, first list what will change when a Combiner runs, and then explain its performance impact.  If you completed 1a) with a single MapReduce job, then answer the question referring to that job.

[10 marks medium]