Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECS765P (2017-2018)

ECS765P Big Data Processing


Question 1

You have a dataset of tweets annotated with the personality of the author, according to the Myers Briggs Type Indicator (or MBTI for short). MBTI is a personality type system that divides everyone into 16 distinct personality types across 4 axis (introversion/intuition/- thinking/perceiving).

The dataset is a collection of rows, each one containing the following information for a single user:

[ u s e r I d ; mbtiType ; messages ]

The messages field contains the last 50 tweets the user posted (With each tweet separated by ”;;;” (3 semicolon characters))

(a)  Write a Map/Reduce program that computes the average length of the tweet messages from each personality type.

Use pseudocode for the program specification. You must clearly define the input and output of each one of your functions. State in your solutions any assumptions that are made as part of the program, as well as the behaviour of any custom function you deem necessary.

The code flow must be explained, discussing the input and ouput of each function that has been defined. You may use a diagram to illustrate the overall data flow.

[13 marks basic]


p u b l i c v o i d Map ( S t r i n g u s e r I d , P a i r<S t r i n g , S t r i n g>

S t r i n g t y p e = data . getType ( ) ;

S t r i n g [ ] messages = data . getMessages ( ) . s p l i t ( | | | ) ; i n t n = 0 ;

i n t t o t a l L e n g t h = 0 ;

w h i l e ( messages . hasNext ( ) ) (

n ++;

t o t a l L e n g t h += n e x t ( ) . l e n g t h ( ) ;


e m i t ( type , new   P a i r   ( messages , t o t a l L e n g t h ) ) ; }

p u b l i c   v o i d   Reduce   ( S t r i n g   type , L i s t <P a i r> v a l u e s ) (

i n t totalMessages = 0 ;

i n t t o t a l L e n g t h = 0 ;

f o r ( P a i r p a i r :   values ) (

totalMessages += p a i r . g e t L e f t ( ) ;

t o t a l L e n g t h += p a i r . g e t R i g h t ( ) ;


emit ( type , t o t a l L e n g t h / totalMessages ) ;


Marking scheme:

Code ow (input data, mapper output, reducer input, reducer output) Mapper code: marks Reducer code: marks

Discuss how you would modify the program presented in 1a) in order to compute for each personality type both the number of members and the average length.              You should base your explanation in what information can be transferred in the flow of a Map/Reduce job.

[5 marks advanced]