Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

EECS 4415U Big Data Systems

Mining Frequent Itemsets

Description

Your project task it to conduct market-basket analysis by developing on your own and comparing experimentally various frequent itemset mining algorithms, including, Apriori, PCY, sampling and SON.  The goal is to find frequent pairs of items (since this is the most computationally involving step) and analyze the results.

Resources

- See Lecture 2 posted on eClass page

- Chapter 6 from MMS book posted on eClass (Association Rules Data Mining).

Programming Language

You can choose your favorite programming language (e.g., C++, Java, C#, Python and Go)

Tasks and Schedule

1. Setting up environment, implementation of Apriori algorithm for finding frequent pairs of elements (Week 1).

2. Finishing implementation of Apriori algorithm and conducting scalability experiments on given datasets (Week 2). Starting implementation of PCY algorithm.

3. Finishing implementation of PCY Algorithm for finding frequent pairs of elements. Compare experimentally performance of PCY vs Apriori (Week 3).

4. Extend your implementation to the Random Sampling and SON version of the algorithm (Week 4).  What is the efficiency of the method in comparison to algorithms developed in Points 1-3? Quantify the number of false positives.

5. Implement Multistage OR Multihash version of the PCY algorithm. Finish the implementation and write the final report (Week 5).

Datasets

a) The retail dataset contains the (anonymized) retail market basket data (~88200 baskets) from an anonymous Belgian retail store.

Note that since the dataset was anonymized the preprocessing step to map text labels into integers is done for you. Working with integers is more efficient than textual data as it saves the main memory.

Retails dataset is available to download at:

     Retail

Use Notepad++ or other software rather than Notepad to open the file for the correct formatting.

b) Netflix dataset

Conduct additionally experiments over the larger Netflix dataset to test the limitations of your implementations.

Netflix data set is available at:

     Netflix

Experiments

Provide the runtimes on the entire dataset for both Retail and Netflix for all algorithms listed in Weeks 1-5.

Use a support threshold of 1% and 2%.

Additionally, perform the scalability study on Netflix dataset for only Apriori algorithm for finding frequent pairs of items with a support threshold of 1% by dividing the data into the chunks: 20%, 40%, 60%, 80% and 100% and measuring the time performance. Provide the results in the diagram figure (with size of the chunk of data on X-axis and runtime on the Y-axis).

Note that if the running of the algorithm extends 5h, you can terminate your program, and report it.

Save the found frequent pairs of items. Quantify the number of false positives for the sampling methods.

Report

After Week 5 you will have to submit the entire report (for tasks in Weeks 1-5), which will be marked towards your project. Attach separately your code.

Award

The most efficient implementation will be determined and given an extra 1 bonus point.