COMPSCI5011 Information Retrieval M August 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Information Retrieval M
COMPSCI5011
2021
SECTION A
1.
(a)
The following documents have been processed by an IR system where stemming is not applied:
DocID |
Text |
Doc1 |
breakthrough vaccine for effective covid19 treatment |
Doc2 |
new vaccine: covid19 vaccine is approved for treatment |
Doc3 |
new approach for treating patients |
Doc4 |
new hopes for new covid19 patients in the world |
(i) Assume that the following terms are stopwords: in, is, for, the. Construct an inverted file for these documents, showing clearly the dictionary (aka lexicon) and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij *log2(N/dfi)
[5]
(ii) Compute the term weights of the two terms “effective” and “treatment” in
Doc1. Show your working.
[2]
(iii) Assuming the use of a best match ranking algorithm, rank all documents using
their relevance scores for the following query:
covid19 vaccine
Show your working. Note that log2(0.75)= -0.4150 and log2(1.3333)=
0.4150.
[3]
(iv) Typically, a log scale is applied to the tf (term frequency) component when
scoring documents using a simple tf*idf term weighting scheme. Explain why this is the case illustrating your answer with a suitable example in IR. Explain through examples how models such as BM25 and PL2 control the term frequency counts.
[4]
(b) Consider a query q for which there are 8 relevant documents . Consider an IR system
whose ranking for this query is sown below . In particular, its top 10 ranked documents were judged for relevance where R means the document was deemed relevant and N otherwise (the leftmost item is the top ranked search result) .
System ranking: R R R N R N N N N N
(i) Compute the precision, recall, precision at rank 5 and the average precision of the system for this query . Show your calculations .
[3]
(ii) Assume that you are using the IR system to find as many relevant documents
as possible for your query q . Discuss the suitability of the above computed evaluation metrics in reflecting the quality of rankings produced by the
system? Justify your answer .
[2]
(c) Illustrating your answer with suitable examples of information needs and/or queries, provide two situations where pseudo-relevance feedback might not be effective and explain why .
Suggest two possible solutions to enhance the performance of the system in those situations.
[5]
(d) Consider a query with two terms, whose posting lists are as follows:
term1 → [id= 1, tf=3], [id=7, tf=2], [id=9, tf=1]
term2 → [id= 1, tf=4], [id=3, tf=2] , [id=7, tf=4]
Explain and provide the exact steps/order in which the posting lists will be traversed by the TAAT & DAAT query evaluation strategies and the memory requirements of both strategies for obtaining a result set of K documents from a corpus of N documents (K<N).
[6]
2.
(a) Assume that the set of relevant documents for a query q is Rq, where:
Rq = {doc2, doc3, doc4, doc8}
An IR system ranks the top 10 results for this query as follows (the leftmost item is the top ranked search result):
doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9 doc10
(i) Provide the precision and recall values at each rank of the produced ranking for the query q. Show your workings.
[2]
(ii) Provide the interpolated precision values at the 11 standard recall levels for
the ranking produced by the system for the query q. Show your workings. [2]
(b) Consider the query “jackson music” and the following term frequencies for the
three documents D1, D2 and D3, where the search engine is using raw term frequency (TF) but no IDF:
|
indiana |
jackson |
life |
michael |
music |
pop |
D1 |
0 |
4 |
0 |
3 |
0 |
6 |
D2 |
4 |
0 |
3 |
4 |
0 |
0 |
D3 |
0 |
3 |
0 |
5 |
4 |
4 |
Assume that the system has returned the following ranking: D2, D3, D1. The user judges D3 to be relevant and both D1 and D2 to be non-relevant.
(i) Show the original query vector, clearly stating the dimensions (i .e . terms) of the vector .
[2]
(ii) Use Rocchio’s relevance feedback algorithm (with α=β=γ=1) to provide a revised query vector for “jackson music” . Show the dimensions (i.e. terms) in the revised query. Terms in the revised query that have negative weights can be dropped, i.e. their weights can be changed back to 0. Show all your calculations.
[4]
(c) Suppose we have a corpus of documents with a dictionary of 6 words w1, ..., w6 . Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2 (fourth column), where ct(w, doci) is the count of word w (i.e. its termfrequency) in document doci . Let the query q be the following:
q = w1 w2
Word |
p(w|C) |
ct(w, doc1) |
ct(w, doc2) |
w1 |
0.8 |
2 |
7 |
w 2 |
0.1 |
3 |
1 |
w3 |
0.025 |
2 |
1 |
w4 |
0.025 |
2 |
1 |
w 5 |
0.025 |
1 |
0 |
w6 |
0.025 |
0 |
0 |
SUM |
1.0 |
10 |
10 |
(i) Assume that we do not apply any smoothing technique to the language model for doc1 and doc2 . Calculate the query likelihood for both doc1 and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations.
Provide the resulting ranking of documents and state the document that would be ranked the highest.
[3]
(ii) Suppose we instead smooth the language models for doc1 and doc2 using the
Dirichlet prior smoothing method with parameter µ = 10. Recall that for a given word w, the smoothed probability using the Dirichlet prior smoothing method is estimated as follows:
p(w|doci ) =
where |doci | is the document length of doci in tokens.
Provide the probability of q according to the Dirichlet smoothed language model for both doc1 and doc2, i.e., p(q|doc1) and p(q|doc2). Do not compute the log-likelihood; i.e. do not apply any log scaling. Show your calculations.
Provide the resulting ranking of documents and state the document that would be ranked the highest.
[5]
(iii) Explain which document you think should be reasonably ranked higher
(doc1 or doc2) and why?
[3]
(d) Assume that you are developing a crawler to discover documents written in different languages on the Web and to make them available for search. Your crawler will use the following components:
(A) a stemming process;
(B) a language classifier to identify the language of each document (C) a stopword removal process
(D) a filter that detects format (pdf, Word, etc.) of the document. (E) a filter that detects if the document has already been discovered (F) a parser that extracts content and outgoing links from a document (G) a robots.txt filter that checks if a given document is ok to crawl
Provide the precise sequence in which your crawler will apply the components to a discovered document. Justify your answer.
[4]
(e) Neural retrieval models often use a re-ranking strategy over BM25 to reduce
computational overhead.
What is the key limitation of this strategy? Describe in sufficient details an approach that you might use to overcome this problem.
[5]
SECTION B
3.
(a) Consider a collection of newspaper articles ranging over a period of 10 years.
Assume that after text processing and indexing the collection of articles in the first 5 years, you observe that it contains 400 million tokens with a lexicon size (number of unique words) of 1 million words.
Assume that the collection grows uniformly every year (the number of tokens in the collection grows uniformly). Use Heaps' Law (V = k nβ with β= 0.5) to predict the number of tokens as well as the lexicon size of the collection after indexing its full content (10 years). Justify your answer.
[3]
(b) The probability ranking principle is at the heart of probabilistic models in
information retrieval.
Using your own words, provide a short statement summarising the probability ranking principle. Explain why it is fundamental and needed in probabilistic IR modelling.
Provide two examples where the ranking produced by an IR system is optimal according to the probability ranking principle, but the ranking is not satisfactory from a user’s perspective. Justify your answers.
[4]
(c)
For a particular query q, the multi-grade relevance judgements of all documents are {(d1,1),(d3, 3),(d6, 4),(d9, 3),(d11, 1),(d31, 2)}, where each tuple represents a document ID and a relevance judgment pair, and all the other documents are judged as non-relevant. Documents are judged on the scale 0-4 (0:not relevant – 4:highly relevant). An IR system returns the following ranked list with respect to this query (these are all results it returned for this query and in that order):
d31, d22, d3, d6, d15
Assume that 1/log2 (rank) is used as the discount factor. You might wish to note that: log2 2 = 1; log2 3 = 1.59; log2 4 = 2; log2 5 = 2.32; log2 6 = 2.59.
(i) Compute the Normalised Discounted Cumulative Gain (NDCG) at each rank position for the given query. In your answer, provide the ideal DCG values for the perfect ranking for the given query. Show your calculations.
[4]
(ii) The DCG scores are normalised in NDCG. Explain why this normalisation
is needed. Why is such a normalisation not needed in MAP?
[3]
(d) URL length has been shown to be an important feature for some Web search tasks. Briefly discuss which types of information needs on the Web, the URL length feature is most appropriate for.
Consider a linear learning to rank model for Web search using 4 features: PL2, Proximity, PageRank and URL length. Using such a model, explain the main disadvantage of using linear learning to rank models in Web search.
[5]
(e) A posting list for a term in an inverted index contains the following three entries: id=4 tf=3 id=8 tf=3 id=10 tf=4
Applying the delta compression of ids, show the binary form of the unary compressed posting list. What is the resulting (mean) compression rate, in bits per integer?
[4]
(f) You are working on a search engine, and you noticed that a document entitled "City
of York's (England) new COVID restrictions" is retrieved when a user searches for "New York City COVID" (In this case, the user is looking for information about the COVID in the US city).
Assuming documents suitable for the user's information need exist in the corpus, describe and explain how this problem could be addressed:
(i) without using machine learning,
[2]
(ii) using traditional learning to rank, and
[2]
(iii) using neural networks.
[3]
2022-07-16