Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECS765P (2017-2018)

ECS765P Big Data Processing

SOLUTIONS AND MARKING SCHEME

Question 1

You have a dataset of tweets annotated with the personality of the author, according to the Myers Briggs Type Indicator (or MBTI for short). MBTI is a personality type system that divides everyone into 16 distinct personality types across 4 axis (introversion/intuition/- thinking/perceiving).

The dataset is a collection of rows, each one containing the following information for a single user:

[ u s e r I d ; mbtiType ; messages ]

The messages ﬁeld contains the last 50 tweets the user posted (With each tweet separated by ”;;;” (3 semicolon characters))

(a)  Write a Map/Reduce program that computes the average length of the tweet messages from each personality type.

Use pseudocode for the program speciﬁcation. You must clearly deﬁne the input and output of each one of your functions. State in your solutions any assumptions that are made as part of the program, as well as the behaviour of any custom function you deem necessary.

The code ﬂow must be explained, discussing the input and ouput of each function that has been deﬁned. You may use a diagram to illustrate the overall data ﬂow.

[13 marks basic]

 Solution: p u b l i c v o i d Map ( S t r i n g u s e r I d , P a i r S t r i n g t y p e = data . getType ( ) ; S t r i n g [ ] messages = data . getMessages ( ) . s p l i t ( ” | | | ” ) ; i n t n = 0 ; i n t t o t a l L e n g t h = 0 ; w h i l e ( messages . hasNext ( ) ) ( n ++; t o t a l L e n g t h += n e x t ( ) . l e n g t h ( ) ; } e m i t ( type , new   P a i r   ( messages , t o t a l L e n g t h ) ) ; } p u b l i c   v o i d   Reduce   ( S t r i n g   type , L i s t

v a l u e s ) ( i n t totalMessages = 0 ; i n t t o t a l L e n g t h = 0 ;

 f o r ( P a i r p a i r :   values ) ( totalMessages += p a i r . g e t L e f t ( ) ; t o t a l L e n g t h += p a i r . g e t R i g h t ( ) ; } emit ( type , t o t a l L e n g t h / totalMessages ) ; } Marking scheme: Code ﬂow (input data, mapper output, reducer input, reducer output) Mapper code: marks Reducer code: marks

Discuss how you would modify the program presented in 1a) in order to compute for each personality type both the number of members and the average length.              You should base your explanation in what information can be transferred in the ﬂow of a Map/Reduce job.