闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECS765P (2019-2020)

ECS765P Big Data Processing

SOLUTIONS AND MARKING SCHEME

Question 1

You have a dataset from a non-proﬁt organisation listing the full history of member contri- butions for emergency appeals when e.g. a disaster happens.

For the next emergency appeals campaign, the organization is aiming to send postal envelopes to the 500 highest contributors up to this point.

The dataset is a collection of rows, each one registering an individual contribution from a member in the following format:

timestamp;memberId;isMultipleDonor;donationAmount;memberAge

The isMultipleDonor ﬁeld is a boolean ﬁeld that will be True if the member has made more than one donation. donationAmount contains the value of that individual donation, whereas memberAge includes for how many months they have been a member.

(a) (i) Design a combination of *two* MapReduce programs that computes the top 500

members with highest individual average contributions to these urgent appeal campaigns. You should only consider members with more than one individual contribution.

The code ﬂow must be explained, discussing the input and output of each Map and Reduce function that has been deﬁned (without detailing the implementation). You may use a diagram to illustrate the overall data ﬂow.

Note: while a single MapReduce solution is technically possible, it would result in potentially some scalability/performance issues. Single MapReduce solutions should mention these aspects speciﬁcally.

(ii) Write the pseudocode for one of the two MapReduce jobs you designed in the

previous question. State in your solutions any assumptions that are made as part of the program, as well as the behaviour of any custom function you deem necessary.

[15 marks — basic]

Solution:

The ﬁrst program will compute the average contribution of each member from the dataset that is a multiple times donor . That input will be fed to the second program that will compute the global average (numerical summarisation) of all these results . First mapper will emit (memberId, donationAmount) , First reducer will emit (memberId, averageAmount) Second mapper will emit (None, (memberId,averageAmount)), Second reducer will emit 500 x (memberId, ranking, averageAmount)

Either of these two MR programs is sufﬁcient for the 5 marks

def mapper1(self, _, line):\mk{1}

fields = line.split(";")

multipleDonor = fields[2]

if(multipleDonor):\mk{1}

memberId = fields[1]

amount = fields[3]

yield (memberId, amount)\mk{1}

def reducer1(self, memberId, values):\mk{1}

average = computeAverage(values)

yield(memberId, average) \mk{1}

def mapper2(self, memberId, average): \mk{1}

yield(None, (memberId, average)) \mk{1}

def reducer2(self, _, values): \mk{1}

sorted = values.sortByValue()

top500 = sorted.getTop(500)\mk{1}

for item in top500:

yield(item[0], item[1]) \mk{1}

(b) This question is about the Combiner in the context of MapReduce jobs.

(i) Explain who runs the Combiner function and at what point in time Combiner functions execute during the shufﬂe and sort stage of MapReduce jobs.

(ii) Discuss whether the *second* job from 1a) would beneﬁt from using a Combiner.

To do so, ﬁrst list what will change when a Combiner runs, and then explain its performance impact. If you completed 1a) with a single MapReduce job, then answer the question referring to that job.

[10 marks — medium]