Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Project - Analyzing IMDB Datasets

For this project, you will be tasked with provisioning a Spark Cluster on AWS EMR for loading and running some analysis on IMDB’s datasets from Kaggle. You will run your analysis via Jupyter Notebook, and the expected output artifact is Project_Analysis.ipynb file.

Requirements

This project is very simple: you are to provision a Spark cluster on AWS EMR, connect it to a Jupyter Notebook and then run a series of queries (in python with DataFrame API or Spark SQL) that answer a few simple questions about the IMDB Data available.

In doing so, you are demonstrating your ability to configure and provision infrastructure using the AWS Elastic Map Reduce ecosystem. Also, you are demonstrating your understanding of how to leverage transformations and actions (as per the Spark terminology) with PySpark in performing basic data analysis tasks on information sources that are too large to manage in memory.

Artifacts

You are to submit a zip file with your project work content (as seen below) inside. Expected Zip file structure:

project

+-- Project_Analysis.ipynb

+-- Project_Analysis.pdf

+-- README.md

Note: I’m ok with the README submitted as pdf as well.

Notebook File

The ipynb file that contains your analysis and the outputs of the code you wrote to arrive at your results. This is very important as this is the sole method of validation that you actually ran an EMR cluster successfully.

README

The README, in markdown or pdf, should contain a brief blurb describing the project and the technology leveraged to conduct your analysis. This ought to be brief and informational, in case folks in the future want to recreate your results.

PS: if you wanted to “test” your readme, you can download a readme viewer like this one

S3 Bucket

You must read IMDB data from my publicly available S3 bucket. Your Project_Analysis.ipynb file must demonstrate that the data is being read from S3 - this is largely as simple loading your DataFrame like so:

name = spark.read.csv('s3://cis9760-lecture9-movieanalysis/name.basics.tsv', sep=r'\t', header=True)

Note: The path to each dataset is given in Project_Analysis.ipynb file.

Submission

Please submit your zip folder to Blackboard before the deadline.

Assignment

The actual analysis is broken into four parts - three which are guided and one that is freeform.  I have published a Project_Analysis.ipynb demonstrating this project. Note that the output of the code written is provided as a means to give you structure as you write your analysis.

Part I: Installation and Initial Setup

In this portion, you will import the necessary dependencies (pandas and matplotlib) and load your dataset as a pyspark dataframe.

Part II:  Analyzing Genres

For this part, you will take a stab at denormalizing the genres that are associated with each title (there may be more than one, presented as a string of comma-separated identifiers) and then run some basic analysis on the result.

Part III: Analyzing Job Categories

For this next part, you will attempt to get the top job categories in the dataset.

Part IV: Answering Questions 

For this final part, you will answer the following questions:

· What are the movies in which both Johnny Depp and Helena Bonham Carter have acted together?

· What are the movies in which Brad Pitt has acted since 2010?

· How many movies has Zendaya acted in each year?

· Which movies, released in 2019, have an average rating exceeding 9.7?

· Among the titles in which Clint Eastwood and Harrison Ford have acted, who has the higher average rating?

· What is the movie(s) with the highest average rating among those in which Chris Evans has acted?

· What is the percentage of adult titles in which actors and actresses have acted?

· What are the top 10 movie genres with the shortest average runtime?

· What are the most common character names for actors and actresses in Romance movies?

This project is due Dec 13th, MIDNIGHT.

RUBRIC

Part 0 – Submission

4

The parent folder is named “project” and is exposed when unzipped

1

README exists under the project folder and describes the project

2

Project_Analysis.ipynb and Project_Analysis.pdf

exist under the project folder

1

Part I – Installation and Initial Setup

7

Any necessary packages are loaded into the environment (pandas, matplotlib, etc)  

1

All four datasets are loaded from the S3 bucket and saved as Spark DF

1

Overview: The number of rows and columns in each Spark DF is displayed

2

The remaining cells are filled properly.

3

Part 2 – Analyzing Genres

15

Association table

4

Total unique genres

4

Average rating / genre

4

A horizontal bar chart of top genres

3

Part 3 – Analyzing Job Categories

10

Total unique job categories

4

Top job categories

2

A bar chart of top job categories

4

Part 4 – Answer to the following questions:

54

What are the movies in which both Johnny Depp and Helena Bonham Carter have acted together?

6

What are the movies in which Brad Pitt has acted since 2010?

6

How many movies has Zendaya acted in each year?

6

Which movies, released in 2019, have an average rating exceeding 9.7?

6

Among the titles in which Clint Eastwood and Harrison Ford have acted, who has the higher average rating?

6

What is the movie(s) with the highest average rating among those in which Chris Evans has acted?

6

What is the percentage of adult titles in which actors and actresses have acted?

6

What are the top 10 movie genres with the shortest average runtime?

6

What are the most common character names for actors and actresses in Romance movies?

6