Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSE6232 Hadoop, Spark, Pig and Pandas

Task 1: Analyzing a Large Graph with Hadoop/Java

Your task is to write a MapReduce program in Java to calculate the maximum of the weights of all outgoing edges for each node in the graph.

You should have already loaded two graph files into HDFS. Each file stores a list of edges as tab-separated-values.

Each line represents a single edge consisting of three columns: (source node ID, target node ID, edge weight), each of which is separated by a tab (\t). Node IDs are positive integers, and weights are also positive integers. Edges are ordered randomly.

src tgt weight

15 127 2

15 134 3

15 599 3

511 330 51

511 694 79

230 15 11

Task 2: Analyzing a Large Graph with Spark/Scala

Your task is to cascade the edge weights in graph1.tsv and graph2.tsv to node weights, and finally determine the accumulated node weights using Spark, in Scala. Assume that 80% of the edge weight comes from the source node and 20% from the target node. When loading the edges, parse the edge weights using the t oInt method and before cascading, filter out (ignore) all edges whose edge weights equal 1. That is, only consider edges whose edge weights do not equal 1.

Consider the following example:

Input:

src tgt weight

1 2 40

2 3 100

1 3 60

3 4 1

Output:

1 80.0 = 0.8*40 + 0.8*60

2 88.0 = 0.2*40 + 0.8*100

3 32.0 = 0.2*100 + 0.2*60

Task 3: Analyzing Large Amount of Data with Pig on AWS

For each unique bigram, compute its average number of appearances per book.

For the above example, the results will be:

I am (342 + 211) / (90 + 10) = 5.53

very cool (500 + 3210 + 9994) / (10 + 1000 + 3020) = 3.40049628

Output the 10 bigrams having the highest average number of appearances per book along with their corresponding averages, in t abseparated format, sorted in descending order. If multiple bigrams have the same average, o rder them alphabetically. For the example above, the output will be:

I am 5.53

very cool 3.40049628

You will solve this problem by writing a PIG script on Amazon EC2 and save the output.

Task 4: Explore and Analyze data with Pandas

Download t ask4data.zip. The data that you download has a readme.txt file which has details about how the data is stored. Using pandas, load the data as dataframes and find the number of unique movies and number of unique users in the dataset.

Output format:

Number_of_unique_movies

Number_of_unique_users