Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP9024 22T2

Assignment

Data Structures and Algorithms

Change Log

We may make minor changes to the spec to address/clarify some outstanding issues. These may require minimal changes in your design/code, if at all. Students are strongly encouraged to check the online forum discussion and the change log regularly.

Version 0.1

(2022-07- 11 15:00)

Version 0.2

(2022-07- 14 13:00)

Version 0.3

(2022-07- 14 17:00)

Version 0.4

(2022-07- 14 18:15)

Version 0.5

(2022-07-21 02:15)

Version 0.6

(2022-07-21 04:15)

Version 0.7

(2022-07-21 13:50)

Version 1.0

(2022-07-21 22:00)

Assignment Files Download this zip file.

Initial release.

Assignment Files: added micro-web link/image.

Subset 0: explicitly noted that string.h may be used and memory for list/graph structures should be allocated dynamically. Assessment: added submission and due date details.

Subset 3: in crawler.c code stub replaced fgets and strlen lines with call to scanf.

Subset 1: added sample output for micro-web, nothing to ignore.

Subset 2: added sample output for micro-web, nothing to ignore.

Subset 1: added testing.

Subset 2: added testing.

Subset 3: added sample output for micro-web, nothing to ignore.

Subset 3: added testing.

Subsets 1, 2, 3: added sample output and dryrun test cases with ignore lists.

Unzipping the file will create a directory called 'files' with all the assignment start-up files.

You can also make note of the following URLs:

http://www.cse.unsw.edu.au/~cs9024/micro-web

http://www.cse.unsw.edu.au/~cs9024/mini-web

Here is a visual representation of the micro-web:

Once you read the spec, hopefully it will be clear to you what you could use these URLs for! You may also find it useful to construct a similar graph visualisation for the mini-web.

Background

As we have mentioned in lectures, the Internet can be thought of as a graph (a very large graph). Web pages represent vertices and hyperlinks represent directed edges.

With over 1.2 Billion unique websites (as of June 2021) and each website having multiple webpages and each webpage having multiple hyperlinks it can understandably be a very difficult job to remember the URL of every website you want to visit.

In order to make life easier, from the very early days of the internet, there have been search engines that can be used to find websites.

But the job of a search engine is very difficult: First it must index (create a representation of) the entire (or as close to it as possible) internet. Next it must rank the webpages it finds.

In this assignment we will be implementing algorithms to solve each of these problems.

1.  To index the internet we will be creating a web crawling.

2.  To rank webpages we will implement the pagerank algorithm.

3.  To find the shortest path between two pages we will implement dijkstra's algorithm

Subset 0 - Dependencies

Before we can start crawling we need to be able to store our crawled data. As the internet is a Graph, this means we need a Graph ADT. We will also need a Set ADT and one of a Queue ADT or a Stack ADT in order to perform web scraping (for a BFS or DFS).

Subset 0a - Implement List (Queue, Stack, Set) ADT

Task

You have been provided with a file list.h. Examine the given list.h file carefully. It provides the interface to an ADT that will provide Queue, Stack, and Set functionality.

You should create the file list.c to implement this ADT.

Your list ADT should store string (char *) data.

You may use the string.h library and other standard libraries from the weekly exercises.

You are required to dynamically allocate memory in your ADT for full style marks.

You can use whatever internal representation you like for your list ADT.

You can reuse the code previously submitted for weekly assessments in your list ADT.

As a reminder:

Queue - First In, First Out

Stack - First In, Last Out

Set - Only stores unique values.

See list.h for more information about each function that you are required to implement.

Remember that you cannot modify list.h.

You should write programs that use the list ADT to test and debug your code.

Subset 0b - Implement Graph ADT

Task

You have been provided with a file graph.h. Examine the given graph.h file carefully. It provides the interface to an ADT that will provide Graph functionality.

You should create the file graph.c to implement this ADT.

Your graph ADT should store string (char *) data in its vertices.

You may use the string.h library and other standard libraries from the weekly exercises.

You are required to dynamically allocate memory in your ADT for full style marks.

You must use an adjacency list representation for your ADT. But the exact implementation is up to you.

You can reuse the code previously submitted for weekly assessments in your graph ADT.

See graph.h for more information about each function that you are required to implement.

Note that graph.h indicates that each edge has a weight.

Remember that you cannot modify graph.h.

You should write programs that use the graph ADT to test and debug your code.

Subset 1 - Web Crawler

Task

We are now going to use the list and graph ADTs you have created to implement a web scraper.

After compiling with make crawler (using Makefile given), run the executable and check that the output aligns with the navigation of the sample website. Once you have completed the appropriate list functions, carefully examine the code crawler.c.

Uncomment the block of code that uses 'scanf' to take user inputs for 'ignore list' in the provided implementation, so that users can provide some URLs as input.

The ignore list represents the URLs that we would like to completely ignore when you are calculating page ranks, as if they did not exist in the graph. This means that any incoming and outgoing edges from these URLs are treated as non-existent. You are required to implement this functionality locally - within the graph_show()        function - and NOT to change the actual graph ADT. For more details see the graph.h file.

crawler.c requires external dependencies (libcurl and libxml2)

The provided Makefile will work on CSE servers (ssh and vlab) but might not work on your home computer.

If you have correctly implemented the ADTs from the previous tasks, this part should be mostly free.

Note: crawler.c is a complete implementation of a web crawler; you do not need to modify the utility functions, only the bottom part of the main function. However, you should look at the program carefully and understand it well so that you can use it (i.e., modify it appropriately) for later tasks.

Using a modified crawler.c that simply calls graph_show() on the micro-web, and without ignoring any pages, the output should be:

prompt$ ./crawler http://www.cse .unsw.edu .au/~cs9024/micro-web/index .html

Enter a page to ignore or type  'done': done

http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html

http://www.cse.unsw.edu.au/~cs9024/micro-web/X.html

http://www.cse.unsw.edu.au/~cs9024/micro-web/Y.html

http://www.cse.unsw.edu.au/~cs9024/micro-web/Z.html

http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html

http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html

http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html

Now lets add index.html to the ignore list:

prompt$ ./crawler http://www.cse .unsw.edu .au/~cs9024/micro-web/index .html

Enter a page to ignore or type 'done': http://www.cse.unsw.edu.au/~cs9024/micro-web/index.html

All traces of index.html have been removed. This means that only the remaining vertices are displayed as there are no longer any edges. Note that the order of the    output matters. It should follow the BFS that is performed by the crawler. If your result does not follow this order, you will be marked as incorrect, even if your graph is valid.

Testing

We have created a script to automatically test your list and graph ADTs. It expects to find list.c and graph.c in the current working directory. Limited test cases are provided, so you should always do your own, more thorough, testing.

Subset 2 - PageRank

Background

Now that we can crawl a network and build a graph, we need a way to determine what pages in our network (vertices) are important.

We haven't kept page content so the only metric we can use to determine importance of a page is to check how much other pages rely on its existence. That is: how easy is it to follow a sequence of one or more links (edges) and end up on the page.

In 1998 Larry Page and Sergey Brin created the PageRank algorithm to evaluate this metric.

Google still uses the PageRank algorithm to score every page it indexes on the internet to help order its search results.

Task

In graph.c implement the two new functions graph_pagerank() and graph_viewrank().

First, graph_pagerank() should calculate the PageRank of every page (vertex) in your graph. The algorithm should store and calculate the page rank in the ADT.

The algorithm must exclude the URLs that are provided in an 'ignore list' to the function. Do not remove the pages from the graph, only skip (i.e., ignore) them for calcualtion. This means that you will need to understand what parts of the Page Rank algorithm need to be modified.

Using the ignore list, you will be able to see what happens to the page ranks as certain pages are removed. What should happen to the pagerank of a certain page if you remove all pages pointing to it?

Second, graph_viewrank() should print the PageRank of every page (vertex) in your graph that is NOT in the ignore list.

Pages (vertices) should be printed from highest to lowest rank, if two pages have the same rank then they should be printed alphabetically.

You can add utility functions to graph.c

You can (and most likely will need to) modify your struct definitions in graph.c

You cannot modify the file graph.h

You cannot modify the function prototypes for graph_pagerank() and graph_viewrank()

Algorithm

For t = 0:

PR(pi ;t) =

for t > 0:

PR(pi ;t) = + d × ((pji) ) + ( ))


Where:

N is the number of vertices

pi and pj are each some vertex

t is the "time" (iteration count)

PR(pi ;t) is the PageRank of vertex pi at some time t

d is the damping_factor

M(pi) is the set of vertices that have an outbound edge towards M(pi)

PR(pj ;t − 1) is the PageRank of vertex pj at some time t − 1

D(pj) is the degree of vertex pj , ie. the number of outbound edges of vertex pj

S is the set of sinks, ie. the set of vertices with no outbound edges, ie. where D(pj) is 0