Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Lab 3: Association Rules, Cloud computing

In this lab you will complete two Parts:

1. Complete code to implement the naive and A-priori algorithms for finding association rules. Test code “locally” in google colab.

2. Set up an account with Google Cloud Platform, deploy a dataproc cluster, and test pyspark job submissions on the cluster. You will then get empirical timing results for your A-priori algorithm on different cluster allocations.

Start with the student version of Lab 3 part 1 and make your own copy to edit:

https://colab.research.google.com/drive/1aKWQLXgKPmRpy8bCHc9kgrlQXJxvLiF4

During part 2 please create a lab writeup document that you will submit to the learn assignment as a pdf in addition to your colab notebook.

Note: this lab uses a script to create, run, and destroy the cluster. This saves your credit$. If you cancel or crash sometimes you will need to check the GCP clusters page to make sure your cluster was destroyed (https://console.cloud.google.com/dataproc/clusters). If not, manually destroy it.

Part 1 [50 points total]

Association Rules are frequently used for Market Basket Analysis (MBA) by retailers to understand the purchase behavior of their customers. This information can be then used for many different purposes such as recommendations, cross-selling and up-selling of products, sales promotions, loyalty programs, store design, discount plans and many others.

Evaluation of item sets: Once you have found the frequent itemsets of a dataset, you need to choose a subset of them that are significant and interesting. Commonly used metrics for measuring significance and interest for selecting rules for recommendations are:

● Confidence, conf(I -> j) = support(I u j) / support(I)

● Interest, interest(I -> j) = conf(I -> j) - Pr[j]

Application in product recommendations: The action or practice of selling additional products or services to existing customers is called cross-selling. Giving product recommendations is one of the examples of cross-selling that are frequently used by online retailers. One simple method to give product recommendations is to recommend products that are frequently browsed together by the customers. Suppose we want to recommend new products to the customer based on the products they have already browsed online.

Question 1: Find products which are frequently browsed together [30 points total]

In the given browsing.txt, each line represents a web browsing session of a customer (a “basket”). On each line, each string of 8 characters represents the ID of an item browsed during that session. The items are separated by spaces. For example, this first line of browsing.txt:

FRO11987 ELE17451 ELE89019 SNA90258 GRO99222

represents a browsing session (a “basket”) with 5 item IDs.

1a) [10 points for correct top 5 most frequent pairs] Implement a naive Spark approach to finding frequest product pairs with support = 100 (i.e. product pairs need to occur together at least 100 times to be considered frequent):

1. Create an RDD from the lines in the file

2. Map each line into a list of all pairs of items in the basket.
Hint: this is an N^2 operation and can be done by writing a function with a nested loop that processes a single basket and then mapping this function over the RDD
Hint: you can ensure that a pair is counted regardless of whether it appears in a basket as …,A,...,B,... or …,B,...,A,... by outputting it as tuple(sorted((item1, item2))) which would output (A,B) for both cases

3. Reduce the pairs into pair counts and return them

4. Test the top 5 most frequent pairs, you should get:

[(('DAI62779', 'ELE17451'), 1592), (('FRO40251', 'SNA80324'), 1412), (('DAI75645', 'FRO40251'), 1254), (('FRO40251', 'GRO85051'), 1213), (('DAI62779', 'GRO73461'), 1139)]

1b) [20 points total] The naive approach can be slow and disk/memory intensive for large sets of browsing data. Improve this by implementing the two step A-priori algorithm.

1. Create an RDD from the lines in the file

2. [5 points for correct top 5 most frequent items] Map each line into its items and then perform an item count on the entire RDD (this is step 1 of the A-priori algorithm). If you take the top 5 most frequent items you should get:

[('DAI62779', 6667), ('FRO40251', 3881), ('ELE17451', 3875), ('GRO73461', 3602), ('SNA80324', 3044)]

3. Filter the frequent items and keep only those with given number of occurrences (default is 100). Collect these as a map and broadcast this map to every Spark worker (see the Spark documentation for broadcast).

4. We can now begin Step 2 of the A-priori algorithm. This now follows the same as the naive algorithm except that when we map each line into a list of all pairs of items in the basket we first check to ensure that both items in the pair occur in the broadcast map from Step 1 of the A-priori algorithm. This significantly decreases the number of pairs that we output from this step.

5. Reduce the pairs into pair counts

6. [10 points for correct top 5 most frequent pairs] Take the top 5 most frequent pairs and display them, you should still get:

[(('DAI62779', 'ELE17451'), 1592), (('FRO40251', 'SNA80324'), 1412), (('DAI75645', 'FRO40251'), 1254), (('FRO40251', 'GRO85051'), 1213), (('DAI62779', 'GRO73461'), 1139)]

7. [5 points for timing] The apriori algorithm should gain efficiency by generating only pairs where each item meets the given threshold of support. Thus a higher threshold should be faster. If we suspect the top 5 most frequent pairs have at least 1000 counts, then we can raise the threshold to 1000 and get the same result faster. In my reference implementation running the timing code gives:

naive took 10.005960464477539 seconds, apriori(100) took 4.208028793334961 seconds, and apriori(1000) took 1.4184410572052002 seconds

Question 2: Find interesting pair rules [20 points total]

2a) [10 points for correct top 5 confidence score rules] Using the item pairs (X, Y) found in Question 1, compute the confidence score for the corresponding association rules X->Y and Y->X. Here are the top 5 confidence score rules:

[(('DAI93865', 'FRO40251'), 1.0), (('GRO85051', 'FRO40251'), 0.9991762767710051), (('GRO38636', 'FRO40251'), 0.9906542056074765), (('ELE12951', 'FRO40251'), 0.9905660377358491), (('DAI88079', 'FRO40251'), 0.9867256637168142)]

2b) [10 points for correct top 5 interest score rules] Then compute the interest score for each of them. Report the top 5. You will need to use maps, joins, and lambda expressions. Here are the top 5 interesting rules:

[(('DAI43868', 'SNA82528'), 0.9538739086342056), (('DAI93865', 'FRO40251'), 0.8752130156586605), (('GRO85051', 'FRO40251'), 0.8743892924296656), (('GRO38636', 'FRO40251'), 0.865867221266137), (('ELE12951', 'FRO40251'), 0.8657790533945096)]

Part 2 [40 points total]

Step 1) Add Google Cloud Platform credits to your google account

You will need to have an account with google and a University of Canterbury email address to get started.
Click on the following URL in order to request a Google Cloud Platform coupon. You will be asked to provide your school email address and name. An email will be sent to you to confirm these details before a coupon is sent to you. [please do not share this link outside of this course and please only use one coupon per student, we have a limited number of these for use in DATA301]

Student Coupon Retrieval Link

Step 2) Set up a project [10 points]

1. Sign-in to Google Cloud Platform console at console.cloud.google.com and create a new project by clicking the Select a project in the upper left corner (note: your screen may look different):

 

You can name it anything you want. I am using the following convention:


* record the Project ID somewhere, you will need it later

2. Next, you'll need to enable billing in the Cloud Console in order to use Google Cloud resources. You should be able to see the Billing Account for Education if you redeemed your credits earlier:

Now go back to your cloud console for the project you just created, click the drop down Navigation Menu in the upper left and choose Billing. Make sure your project is now billing to the Billing Account for Education:

3. Take a screenshot of your Billing page and add to your writeup for Lab 3 to get credit for this step

Step 3) Run apriori on a google storage file [10 points]

Start from this colab notebook and make your own copy:
https://colab.research.google.com/drive/1Gwcd0SwsCPTG31Nn3kPvACQ_OJkRmytD

1. Edit the cell that contains the apriori function and add your own working code. Notes: the cell has a magic command %%writefile that makes colab output the contents of the cell to a file instead of running the code. This is useful because we are uploading the code to google cloud later.

2. Edit the USERNAME to be your username.

USERNAME=#You need to set this to a string of your "username", i.e. USERNAME="jatlas"

NOTE: if you didn’t use the same format I did for creating your project then you will need to manually change the environment variables to match your project ID

3. Run the google cloud project setup code in the colab notebook. It will provide you a link to log in to google and authorize your notebook to access your google cloud project. After that it adds APIs to the project, creates a storage bucket to hold files on the cloud, and uploads the data file to the bucket.

4. Run the cluster create/execute/delete code. It creates a single node cluster, executes pyspark_apriori.py, and deletes the cluster.
NOTE: it may take 5-10 minutes to start the cluster, run the code, and delete the cluster

5. After running it, verify that you see the elapsed time in the output, that you can see the uploaded files in your cloud storage:
https://console.cloud.google.com/storage/browser and that you can see the job processed successfully at:
https://console.cloud.google.com/dataproc/jobs
Click on the job id number to see the job detail view that includes the output log.

6. Take a screenshot of the job details showing the job id, success, and the output log and add that to your writeup for Lab 3 to get credit for this step.

Step 4) Run several apriori frequent pairs with additional vCPU cores [20 points total]

In our original configuration we are using 1 master/worker node with 1 vCPUs and 6GB of memory. This is as close as we can get to a sequential timing for the code. This will be your data point for Tseq and P=1.

It should be approximately 18 seconds from my empirical tests.

1. Modify the arguments for the gcloud dataproc clusters create to test other configurations (or create a new cell, just be careful not to create multiple clusters at the same time). We want to test and record times for P=4, P=8, and P=16. To do this we will change to standard machine types, remove the –single-node flag, and add arguments for worker nodes. Here is an example for P=4:

!gcloud dataproc clusters create $CLUSTER --region=$REGION --bucket=$BUCKET --zone=$ZONE \

--master-machine-type=n1-standard-2 --worker-machine-type=n1-standard-2 \

--image-version=1.5 --max-age=30m --num-masters=1 --num-workers=2

This is P=4 because it creates 2 worker nodes with 2 vCPUs on each. Re-run the apriori job and record the elapsed time and take a screenshot of the output. Repeat this for P=8 and P=16 by scaling the number of vCPUs (keep the –num-workers=2). [10 points: Add all of the screenshots to your writeup]

2. Compute speedup (Tseq / Tpar) for each setting of P, using Tseq = your P=1 data point. Discuss the trend that you see and what implications this has for how well our code scales vertically (i.e. as we add more processors).

[10 points: correct computation and discussion of trend including projection of what might happen for P>16]