Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INFORMATION RETRIEVAL M

COMPSCI 5011

October 2021

Section A

1. 

(a) Stemming is generally applied when indexing documents. Explain how stemming typically impacts the recall measure when retrieving documents. Justify your answer.

 [4]

(b) Consider the recall-precision graph below showing the performances of two variants of a search engine that mimic Google Scholar on a collection of research papers. There is no difference between the two variants apart from how they score documents. Assume that you are a student looking to find all published papers on a given topic. In other words, you do not want to miss any of the relevant documents. Explain which search engine will be more suitable for your task and why?

[5] 

 

(c) Assume that you have decided to modify the approach you use to parse the documents of your collection. You have developed a new parsing algorithm that uses recent natural language processing techniques to decide on which sentences to index within a document and which sentences to skip. Explain, in detail, the steps you need to undertake to determine whether your new parsing approach produces a better retrieval performance than the original approach that indexes each sentence in each document.

 [6]

(d) Assume that you are given the task of developing a new search system for a commercial research centre. The underlying document collection contains a large number of documents from three different academic journals and your task is to build a search system that allows searches on the whole collection and also on the individual collections (i.e., documents from individual journals). In addition, you need to provide a search functionality based on the author, title and abstract parts of the documents. 

Discuss the issues involved in constructing an inverted index for this scenario. Explain the structure of an inverted index that you may use. Discuss the reasons behind such a structure. In particular, as part of your answer (500 words max), you should (i) describe and justify the procedure for constructing a suitable inverted index for this scenario (ii) provide a suitable structure for the resulting inverted index and (iii) explain why your described structure is a suitable one.

[10]

(e) Consider a query with two terms, whose posting lists are as follows: 

term1 → [id=4, tf=2], [id=9, tf=1]

term2 → [id=4, tf=5], [id=7, tf=4], [id=9, tf=2]

Explain and provide the exact steps/order in which the posting lists will be traversed by the TAAT & DAAT query evaluation strategies.

Compare and contrast the memory requirements of these two strategies for obtaining a result set of K documents from a corpus of N documents (K<N). Justify your answer.

[5]

2.

(a) Suppose we have a corpus of documents with a dictionary of 8 words w1, ..., w8. Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the raw word counts in doc1 (third column), where ct(w, doci) is the raw count of word w (i.e. its term frequency) in document doci. The fourth column corresponds to a classical unigram language model for document doc1 estimated using the non-smoothed maximum likelihood estimator.  

Word

p(w|C)

ct(w, doc1)

plm(w, doc1)

w1

0.4

2

 

w2

0.15

2

 

w3

0.05

1

 

w4

0.1

2

 

w5

0.05

2

 

w6

0.15

0

 

w7

0.05

1

 

w8

0.05

0

 

(i) Provide the missing values in the table for the non-smoothed maximum likelihood probabilities plm(w|doc1) for each of the 8 words (fourth column). Show your calculations.

[4] 

(ii) Suppose we now smooth the language model for doc1 using the Dirichlet prior smoothing method with parameter m = 10. Recall that for a given word w, the smoothed probability using the Dirichlet prior smoothing method is estimated as follows:

 

    where  |doc1| is the document length of doc1 in tokens.

Compute the Dirichlet smoothed probabilities for words w1 and w2 in Doc1. Show your calculations.

 [2]

(iii) For the remaining 6 words of doc1 (w3, w4, w5, w6, w7, w8), explain whether the smoothed probability will be larger than (>) , equal to (=), or smaller (<) than the initial non-smoothed maximum likelihood estimate.  You do not have to compute the actual probabilities, but you need to indicate the expected change. You must justify your answer.

[3] 

(iv) Let q = w1w6 be the query issued by the user. Provide the probability of q according to the Dirichlet smoothed language model for doc1 (recall that m = 10).  Show your calculations.

[2] 

(v) Assume that we make the value of m larger (i.e. > 10). Explain if the probability of q will become larger, smaller or if it will change in an unpredictable manner. Justify your answer.  

[2]

(vi) Assume another document doc2 in the corpus, which is identical to doc1 with the exception that one occurrence of w1 has been changed to word w5. Hence, we have ct(w1, doc2 ) = 1 and ct(w5, doc2) = 3. 

Let q1 = w1w5 be the new query. 

If no smoothing is applied, using the query likelihood retrieval method, state which of the two documents (doc1 or doc2) will be ranked higher. Justify you answer. 

Using the query likelihood retrieval method but this time with Dirichlet prior smoothing applied (m = 10), show which of the two documents (doc1 or doc2) would be ranked higher. Show your calculations.  

Discuss whether smoothing has an impact on the ranking order of doc1 and doc2 and how? Justify your answer.

 [6]

(b) In various DFR models, term frequency tf is replaced with a normalised term frequency tfn as follows:

 

Using an example of at least two documents of varying length, explain why such a normalisation is needed. Describe in detail how the above formula will behave for your chosen example documents.

What is the role of the c parameter in the formula?

 [5]

(c)   Explain the difference between representation and interaction models in neural IR models? 

What is the difference between contextualized and non-contextualised models? Provide an example of a non-contextualised interaction model, and an example of a contextualised representation model studied in the lecture.  Are contextualised or non-contextualised models likely to perform better in Web search and why? 

 [6]

Section B

3.

(a)   Consider the following vector space scoring formula:

 

where ct(w,d) and ct(w, q) are the raw counts of word w in document d and query q, respectively (in other words, the term frequency TF of w in d and q, respectively); Nw is the number of documents in the corpus that contain word w, and M is the total number of documents in the corpus. Provide 4 reasons why the retrieval formula above is very unlikely to perform well in a web search context. Justify your answers.

[5]

(b) Consider a query q for which there are 10 relevant documents in a corpus of 100 documents. Consider an IR system whose ranking for this query is sown below. In particular, its top 16 ranked documents were judged for relevance where R means the document was deemed relevant and N otherwise (the leftmost item is the top ranked search result):

System ranking: R N N R N N N N N R N N N N N R

               

(i) Compute the precision, recall and the average precision of the system for this query. Show your calculations.

[3]

(ii) Now assume that the total number of documents in the corpus is 20,000 documents instead of 1000. Let’s also assume that the system has returned the exact same ranking of documents for query q on this larger corpus. Explain if the system’s retrieval performance on this larger corpus should intuitively be considered better, equal or worse than in the case with the 1000 documents, and briefly discuss whether the above computed metrics allow to convey the actual retrieval performance of the system on both corpora.

[3]

(c) Using a suitable example, explain why Precision at Rank R (P@R) is not a good metric to use for learning an effective learning to rank model. What metric would you suggest to use instead and why?

[5]

(d) Consider a query q, which returns all web pages shown in the hyperlink structure below. Using matrices, explain how you will find the authority scores of each returned web page using the HITS approach. You need to provide all required intermediary matrices, and state the resulting equations but you do not have to resolve these equations.

 

NB: You can indicate matrices in plain text like a table, or write matrices row by row, e.g. ([a,b];[c,d]) shows a matrix with two columns and two rows where the first row is [a,b] and the second row is [c,d].

[5]

(e) Query features are used in learning to rank. Assume that you are developing a search engine for Twitter using a learning to rank approach. 

Provide two examples of suitable query features for Twitter search and discuss how they might help improve the effectiveness of a search engine for Twitter. 

  [4]

(e) A web search engine has devised a new interface to present its search results. Briefly describe three possible methods that could be used by the search engine to evaluate the interface change.  Which method you would recommend and why?

[5]