Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSE3BDC Big Data Tools Task1

Objectives

1. Gain in depth experience playing around with big data tools (MapReduce, Hive and Spark).

2. Solve challenging big data processing tasks by finding highly efficient solutions.

3. Experience processing three different types of real data

a. Standard multi-attribute data (Bank data)

b. Time series data (Twitter feed data)

c. Bag of words data.

4. Practice using programming APIs to find the best API calls to solve your problem. Here are the API descriptions for MapReduce, Hive and Spark (especially spark look under RDD. There are a lot of really useful API calls).

[MapReduce] https://hadoop.apache.org/docs/stable/api/

[Hive]

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

[Spark]

http://spark.apache.org/docs/latest/api/scala/index.html#package

If you are not sure what a spark API call does, try to write a small example and try it in the spark shell.

Expected quality of solutions

a) In general, writing more efficient code (less reading/writing from/into HDFS and less data shuffles) will be rewarded with more marks.

b) All MapReduce code you submit must be able to be compiled using the command

javac -classpath `hadoop classpath` <code_files>

on the Cloudera VM you received from us without requiring the installation of additional components.

c) All MapReduce code you submit should be runnable using

hadoop jar <jar_uri> <hdfs_input_file> <hdfs_output_directory>

For task 2C you need to allow the user to specify another two parameters being the x and y months respectively.

d) Using multiple MapReduce phases maybe appropriate for some of the subtasks. However, if you utilize multiple phases to solve a task, maintain a meaningful and logically consistent naming scheme for your files. (e.g.: Phase1.java, Phase2.java, …)

e) For hive and spark code submissions, ensure that all commands relevant to accomplish the sub-task (i.e. ‘create table’ (hive), loading data AND queries!) are in the same file.

f) Scalability of the code is very important. This is especially important in terms of memory requirements of the mappers and reducers. For example writing a mapper that outputs the same key for any input, will result in all the data going to a single reducer (no matter how many reducers you set). For example, if your mapper takes any string as input and always outputs the same key abc. This effectively means you will end up writing a sequential program. This is completely unacceptable and will result in zero marks for that subtask.

g) This entire assignment can be done using the Cloudera virtual machines supplied in the labs and the supplied data sets without running out of memory. Note task 3 is especially hard to do without running out of memory. But it is possible since we had done it. So it is time to show your skills!

h) Using combiners or local aggregation (inside the mapper) for MapReduce tasks where appropriate will be rewarded with marks. We will be looking at the total amount of data shuffled and awarding higher marks to lower amount of data shuffled.

i) Where ever appropriate use the fact the data is sorted according to intermediate key to reduce the work of the mapper and/or reducer.

j) I am not too fussed about the layout of the output. As long as it looks similar to the example outputs for each task. That will be good enough. The idea is not to spend too much time massaging the output to be the right format but instead to spend the time to solve problems.

k) For Hive queries. We prefer answers that use less tables.

Do the entire assignment using the Cloudera VM. Do not use AWS.

Tips:

1. Look at the data files before you begin each task. Try to understand what you are dealing with! You may find the shell commands “cat” and “head” helpful.

2. For each subtask we give very small example input and the corresponding output in the assignment specifications below. You should create input files that contain the same data as the example input and then see if your solution generates the same output.

3. In addition to testing the correctness of your code using the very small example input. You should also use the large input files that we provide to test the scalability of your solutions.