COMPSCI 753 Algorithms for Massive Data Assignment 2 / Semester 2, 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMPSCI 753
Algorithms for Massive Data
Assignment 2 / Semester 2, 2021
Graph Mining
1 Assignment problem (100 pts)
This assignment aims at exploring the PageRank algorithm on big real-world network data. By working on this assignment, you will learn how to implement the PageRank algorithms that we have learned in the class. Download the dataset1 ’Google-web.txt’ from the assign- ment page on Canvas. Each line of the file represents a directed edge from one node to the
other. Nodes are represented by ID ranging from 0 to 875712.
This assignment has the following tasks.
1. [30 pts] Implement the power iteration in matrix form without considering the dead-ends and spider traps: Let stop criteria be ||r(t+1) _ r(t)||1 < 0.02. Note that || . ||1 denotes L1-norm. Calculate the rank score for all the nodes and report:
(a) The running time and the number of iterations needed to stop (10
pts).
(b) The IDs and scores of the top-10 ranked nodes (20 pts).
Hints: The matrix M could be too big to be stored in the memory. Please consider to use sparse matrix (e.g., use scipy.sparse in Python). If you implement the PageRank algorithm efficiently (consider the sparse matrix formulation discussed in page 30 of the slides), the PageRank algorithm should stop within 10 seconds even on a 7 year old machine with Intel i5 :). If your algorithm can’t stop within several minutes, you may want to check your implementation.
2. [35 pts] Extend your PageRank code to handle spider traps and dead-ends: Let β = 0.9 by default and the stop criteria be ||r(t+1) _ r(t)||1 < 0.02. Run your code on the Google web data and report:
(a) The running time and the number of iterations needed to stop (10
pts).
(b) The IDs and scores of the top-10 ranked nodes (20 pts).
(c) By varying the teleport probability β in [1, 0.9, 0.8, 0.7, 0.6], report the number of iterations needed to stop for each β (5 pts), and explain your findings (5 pts).
3. [35 pts] Implement the biased PageRank algorithm: Let stop criteria be ||r(t+1) _ r(t)||1 < 0.02 and β = 0.9. You are given a teleport set of node IDs: [768277, 296506, 77624, 596601, 234218, 222596, 1376, 783154]. Note that, you need to make sure there is no PageRank score leaked because of dead-ends.
(a) Explain how you address the dead-end problem in biased PageRank
(5 pts)
Run your biased PageRank algorithm and report:
(b) The running time and and the number of iterations needed to stop (10
pts).
(c) The IDs and scores of the top-10 ranked nodes (20 pts).
2022-09-21