Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 1

MET CS 777 - Big Data Analytics

Spark Data Wrangling (20 points)

1 Description

The goal of this assignment is to implement a set of Spark programs in python (using Apache Spark). Specifically, your Spark jobs will analyze a data set consisting ofNew York City Taxi trip reports in 2013. The dataset was released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (https://chriswhong.com/open-data/foil_nyc_taxi/).

2 Taxi Data Set

The data set itself is a simple text file. Each taxi trip report is a different line in the file. Each trip report includes the starting point, the drop-off point, corresponding timestamps, and information related to the payment. The data are reported by the time that the trip ended, i.e., upon arriving in the order of the drop-off timestamps. The attributes present on each line of the file are, in order:

      Attribute      Description

0    medallion    an md5sum of the identifier of the taxi - vehicle bound (Taxi ID)

1   hack license   an md5sum of the identifier for the taxi license (Driver ID)

2   pickup datetime   time when the passenger(s) were picked up

3   dropoff datetime   time when the passenger(s) were dropped off

4   trip time in secs   duration of the trip

5   trip distance   trip distance in miles

6   pickup longitude   longitude coordinate of the pickup location

7   pickup latitude   latitude coordinate of the pickup location

8   dropoff longitude   longitude coordinate of the drop-off location

9   dropoff latitude   latitude coordinate of the drop-off location

10   payment type   the payment method -credit card or cash

11   fare amount   fare amount in dollars

12   surcharge   surcharge in dollars

13   mta tax   tax in dollars

14   tip amount   tip in dollars

15   tolls amount   bridge and tunnel tolls in dollars

16   total amount   total paid amount in dollars

Table 1: Taxi Data Set fields

The data files are in comma-separated values (CSV) format. Example lines from the file

are:

07290D3599E7A0D62097A346EFCC1FB5,E7750A37CAB07D0DFF0AF7E3573AC141,

2013-01-01,00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-73.962440,

40.715008,CSH,3.50,0.50,0.50,0.00,0.00,4.50

22D70BF00EEB0ADC83BA8177BB861991,3FF2709163DE7036FCAA4E5A3324E4BF,

2013-01-01,00:02:00,2013-01-01 00:02:00,0,0.00,0.00000,0.00000,0.0000,0.000000,

CSH,27.00,0.00,0.50,0.00,0.00,27.50

0EC22AAF491A8BD91F279350C2B010FD,778C92B26AE78A9EBDF96B49C67E4007,

2013-01-01,00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-73.965897

73.965897,40.760445,CSH,4.00,0.50,0.50,0.00,0.00,5.00

You can use the following notebook as a helper code to clean up the data, get the required field.

Assignment 1-Helper Code.ipynb - Colaboratory (google.com) (https://colab.research.google.com/drive/1Uaa1MzSNqgCpAXzhezTim_GOUcmrpsA0) You can also pre-process the data and store it in your cluster storage.

NOTE:

To submit the task to Cloud please use the provided code_template.py.

In your submission, only the code based on the code_template needs to be included.

3 Obtaining the Dataset

There are two versions of the dataset, the Small dataset. (93 MB compressed, uncompressed 384 MB) for implementation and testing purposes (roughly 2 million taxi trips) and a Large dataset (8GB compressed)

To download and used the dataset on your computers use the following HTTPS links:

●   Small dataset: https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-small.csv.bz2

●   Large dataset: https://storage.googleapis.com/met-cs-777-data/taxi-data-sorted-large.csv.bz2

When running your code on the cloud, you can have direct access to the files using the following internal links:

Small Data Set: gs://met-cs-777-data/taxi-data-sorted-small.csv.bz2

Large Data Set: gs://met-cs-777-data/taxi-data-sorted-large.csv.bz2

4 Assignment Tasks

4.1 Task 1: Top-10 Active Taxis (5 points)

Many different taxis have had multiple drivers. Write and execute a Spark Python program that computes the top ten taxis that have had the largest number of drivers. Your output should be a set of (medallion, number of drivers) pairs.

Note: You should consider that this is a real-world data set that might include wrongly formatted data lines. You should clean up the data before the main processing, a line might not include all of the fields. If a data line is not correctly formatted, you should drop that line and do not consider it.

4.2 Task 2 - Top-10 Best Drivers (15 Points)

We would like to figure out who the top 10 best drivers are in terms of their average earned money per minute spent carrying a customer. The total amount field is the total money earned on a trip. In the end, we are interested in computing a set of (driver, money per minute) pairs.

4.3 Task 3 - The best time of the day to Work on Taxi (For Advanced Students - no points)

We would like to know which hour of the day is the best time for drivers that has the highest profit per mile. Consider the surcharge amount in dollars for each taxi ride (without tip amount) and the distance in miles, and  sum up the rides for each hour of the day (24 hours) – consider the pickup time for your calculation. The profit ratio is the ration surcharge in dollars divided by the travel distance in miles for each specific time of the day.

Profit Ratio = (Surcharge Amount in US Dollar) / (Travel Distance in miles)

We are interested to know the time of the day that has the highest profit ratio.

4.4 Task 4 - (For Advanced Students - no points)

Here are some further tasks for advanced groups.

●   How many percent of taxi customers pay with cash, and how many percent use electronic cards? Analyze these payment methods for different times of the day and provide a list of percentages for each day time? As a result, provide two numbers for total percentages and a list like (hour of the day, percent paid card)

●   We would like to measure the efficiency of taxi drivers by finding out their average earned money per mile. (Consider the total amount which includes tips, as their earned money) Implement a Spark job that can find out the top- 10 efficient taxi divers.

What are the mean, median, first, and third quantiles of tip amount? How do you find the median?

Using the IQR outlier detection method, find out the top- 10 outliers.

5 Important Considerations

5.1 Machines to Use

One thing to be aware of is that you can choose virtually any configuration for your Cloud Cluster - you can choose different numbers of machines and different configurations of those machines. And each is going to cost you differently! Since this is real money, it makes sense to develop your code and run your jobs locally, on your laptop, using the small data set. Once things are working, you’ll then move to Cloud.

As a proposal for this assignment, you can use the e2-standard-4 machines on the Google Cloud, one for the Master node and two for worker nodes. You will have three machines with a total of 12 vCPU and 48GB RAM. 100 GB of disk space will be enough.

Remember to delete your cluster after the calculation is finished!!!

More       information       regarding       Google       Cloud       Pricing       can       be       found      here https://cloud.google.com/products/calculator. As you can see average server costs around 50 cents per hour. That is not much, but IT WILL ADD UP QUICKLY IF YOU FORGET TO SHUT OFF YOUR MACHINES. Be very careful, and stop your machine as soon as you are done working. You can always come back and start your machine or create a new one easily when you begin your work again. Another thing to be aware of is that Google and Amazon charge you when you move data around. To avoid such charges,  do everything in the Iowa (us-cental1)” region. That’s where data is, and that’s where you should put your data and machines.

You should document your code very well and as much as possible.

Your code should be compilable on a Unix-based operating system like Linux or macOS.

5.2 Academic Misconduct Regarding Programming

In a programming class like ours, there is sometimes a very fine line between cheating” and acceptable and beneficial interaction between peers. Thus, it is essential that you fully understand what is and what is not  allowed in collaboration with your classmates. We want to be  100% precise,  so there can be no confusion.

The  rule  on  collaboration  and  communication  with  your  classmates  is  very  simple:  you  cannot transmit or receive code from or to anyone in the class in any way—visually (by showing someone your code), electronically (by emailing, posting, or otherwise sending someone your code), verbally (by reading code to someone) or in any other way we have not yet imagined. Any other collaboration is acceptable.

The rule on collaboration and communication with people who are not your classmates (or your TAs or instructor) is also very  simple: it is not allowed in any way, period. This disallows  (for example) posting any questions of any nature to programming forums such as StackOverflow. As far as going to the web and using Google, we will apply the two-line rule . Go to any web page you like and do any search that you like. But you cannot take more than two lines of code from an external resource and actually  include  it  in  your  assignment  in any form. Note that changing variable names or otherwise transforming or obfuscating code you found on the web does not render the two-line rule” inapplicable. It is still a violation to obtain more than two lines of code from an external resource and turn it in, whatever you do to those two lines after you first obtain them.

Furthermore, you should cite your sources. Add a comment to your code that includes the URL(s) that you consulted when constructing your solution. This turns out to be very helpful when you’re looking at something you wrote a while ago and you need to remind yourself what you were thinking.

5.3 Turnin

Create a single document that has results for all three tasks.

To demonstrate that you did execute your code on the cloud, it is important to include URLs in the screenshots. Otherwise, there is no way for us to verify if the code was executed in your cloud account.

Also, for each task, for each Spark job you ran, include a screenshot of the Spark History.

Figure 1: Screenshot of Spark History

Please zip up all of your code and your document (use .zip only, please!), or attach each piece of code and your document to your submission individually.

Please have the latest version of your code on GitHub. Zip the GitHub files and submit your latest version of assignment work to Blackboard. We will consider the latest version on the Blackboard, but it should exactly match your code on the GitHub