CS 242 - Winter 2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CS 242 - Winter 2023
Instructions: Submit paper in Canvas by 2/23. This is individual assignment.
Exercise A
Consider the following document D, taken from a collecion C.
"The University of California, Riverside is one of 10 universiies within the presigious University of California system, and the only UC located in Inland Southern California. Widely recognized as one of the most ethnically diverse research universiies in the naion."
Consider the following two queries:
Q1: university Riverside
Q2: diverse university
Characterisics of collecion C are as follows:
# docs in collecion C: 10000
# docs in C that contain "Riverside": 100
# docs in C that contain "university/ies": 500
# docs in C that contain "diverse": 250
Compute the scores of Q1 and Q2 for D, using (a) BM25, and (b) Unigram Language Model (with smoothing method of your choice). Make and state any assumpions necessary, e.g., about the constants in BM25.
Exercise B
Write a program, in the language of your choice, to compute the PageRank scores of a graph. Then, use your program to compute the PageRank score of each node in the graph below.
Show the output of your program for each iteraion, that is, the PR scores for each iteraion unil it
converges.
Use epsilon=0.01 as convergence threshold.
In how many iteraions does the computaion converge?
Exercise C
Show how MapReduce can be used to efficiently solve the following problem:
Given a collecion of input documents, output all bigrams with pointwise mutual informaion greater than a constant T.
The pointwise mutual informaion of two words a,b is computed as P(a,b)/(P(a)P(b)), where P(a,b) is the probability they appear together as bigram a,b.
Write pseudocode for map and reduce funcions.
How would you use a Combiner to optimize your program?
Exercise D
Consider query Q that has a total of 7 relevant results in the collecion, and a search engine that
returns results:
r r x x x x r x x r, where r is a relevant result and x is not relevant.
1. Compute Precision-at-5, Recall-at-5, F1-at-5, Average Precision, and DCG-at-5 (assuming relevant results have score 1 and non-relevant 0).
2. Menion an applicaion where higher precision is more important than higher recall and one for the opposite.
2023-02-10