闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Information Retrieval M

COMPSCI5011

2021

SECTION A

(a)

The following documents have been processed by an IR system where stemming is not applied:

DocID	Text
Doc1	breakthrough vaccine for effective covid19 treatment
Doc2	new vaccine: covid19 vaccine is approved for treatment
Doc3	new approach for treating patients
Doc4	new hopes for new covid19 patients in the world

(i) Assume that the following terms are stopwords: in, is, for, the. Construct an inverted file for these documents, showing clearly the dictionary (aka lexicon) and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij *log2(N/dfi)

[5]

(ii) Compute the term weights of the two terms “effective” and “treatment” in

Doc1. Show your working.

[2]

(iii) Assuming the use of a best match ranking algorithm, rank all documents using

their relevance scores for the following query:

covid19 vaccine

Show your working. Note that log2(0.75)= -0.4150 and log2(1.3333)=

0.4150.

[3]

(iv) Typically, a log scale is applied to the tf (term frequency) component when

scoring documents using a simple tf*idf term weighting scheme. Explain why this is the case illustrating your answer with a suitable example in IR. Explain through examples how models such as BM25 and PL2 control the term frequency counts.

[4]

(b) Consider a query q for which there are 8 relevant documents . Consider an IR system

whose ranking for this query is sown below . In particular, its top 10 ranked documents were judged for relevance where R means the document was deemed relevant and N otherwise (the leftmost item is the top ranked search result) .

System ranking: R R R N R N N N N N

(i) Compute the precision, recall, precision at rank 5 and the average precision of the system for this query . Show your calculations .

[3]

(ii) Assume that you are using the IR system to find as many relevant documents

as possible for your query q . Discuss the suitability of the above computed evaluation metrics in reflecting the quality of rankings produced by the

system? Justify your answer .

[2]

(c) Illustrating your answer with suitable examples of information needs and/or queries, provide two situations where pseudo-relevance feedback might not be effective and explain why .

Suggest two possible solutions to enhance the performance of the system in those situations.

[5]

(d) Consider a query with two terms, whose posting lists are as follows:

term1 → [id= 1, tf=3], [id=7, tf=2], [id=9, tf=1]

term2 → [id= 1, tf=4], [id=3, tf=2] , [id=7, tf=4]

Explain and provide the exact steps/order in which the posting lists will be traversed by the TAAT & DAAT query evaluation strategies and the memory requirements of both strategies for obtaining a result set of K documents from a corpus of N documents (K<N).

[6]

(a) Assume that the set of relevant documents for a query q is Rq, where:

Rq = {doc2, doc3, doc4, doc8}

An IR system ranks the top 10 results for this query as follows (the leftmost item is the top ranked search result):

doc1 doc2 doc3 doc4 doc5 doc6 doc7 doc8 doc9 doc10

(i) Provide the precision and recall values at each rank of the produced ranking for the query q. Show your workings.

[2]

(ii) Provide the interpolated precision values at the 11 standard recall levels for

the ranking produced by the system for the query q. Show your workings. [2]

(b) Consider the query “jackson music” and the following term frequencies for the

three documents D1, D2 and D3, where the search engine is using raw term frequency (TF) but no IDF:

	indiana	jackson	life	michael	music	pop
D1	0	4	0	3	0	6
D2	4	0	3	4	0	0
D3	0	3	0	5	4	4

Assume that the system has returned the following ranking: D2, D3, D1. The user judges D3 to be relevant and both D1 and D2 to be non-relevant.

(i) Show the original query vector, clearly stating the dimensions (i .e . terms) of the vector .

[2]

(ii) Use Rocchio’s relevance feedback algorithm (with α=β=γ=1) to provide a revised query vector for “jackson music” . Show the dimensions (i.e. terms) in the revised query. Terms in the revised query that have negative weights can be dropped, i.e. their weights can be changed back to 0. Show all your calculations.

[4]

(c) Suppose we have a corpus of documents with a dictionary of 6 words w1, ..., w6 . Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2 (fourth column), where ct(w, doci) is the count of word w (i.e. its termfrequency) in document doci . Let the query q be the following:

q = w1 w2

Word	p(w\|C)	ct(w, doc1)	ct(w, doc2)
w1	0.8	2	7
w 2	0.1	3	1
w3	0.025	2	1
w4	0.025	2	1
w 5	0.025	1	0
w6	0.025	0	0
SUM	1.0	10	10

(i) Assume that we do not apply any smoothing technique to the language model for doc1 and doc2 . Calculate the query likelihood for both doc1 and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations.

Provide the resulting ranking of documents and state the document that would be ranked the highest.

[3]

(ii) Suppose we instead smooth the language models for doc1 and doc2 using the

Dirichlet prior smoothing method with parameter µ = 10. Recall that for a given word w, the smoothed probability using the Dirichlet prior smoothing method is estimated as follows:

p(w|doci ) =

where |doci | is the document length of doci in tokens.

Provide the probability of q according to the Dirichlet smoothed language model for both doc1 and doc2, i.e., p(q|doc1) and p(q|doc2). Do not compute the log-likelihood; i.e. do not apply any log scaling. Show your calculations.

Provide the resulting ranking of documents and state the document that would be ranked the highest.

[5]

(iii) Explain which document you think should be reasonably ranked higher

(doc1 or doc2) and why?

[3]

(d) Assume that you are developing a crawler to discover documents written in different languages on the Web and to make them available for search. Your crawler will use the following components:

(A) a stemming process;

(B) a language classifier to identify the language of each document (C) a stopword removal process

(D) a filter that detects format (pdf, Word, etc.) of the document. (E) a filter that detects if the document has already been discovered (F) a parser that extracts content and outgoing links from a document (G) a robots.txt filter that checks if a given document is ok to crawl

Provide the precise sequence in which your crawler will apply the components to a discovered document. Justify your answer.

[4]

(e) Neural retrieval models often use a re-ranking strategy over BM25 to reduce

computational overhead.

What is the key limitation of this strategy? Describe in sufficient details an approach that you might use to overcome this problem.

[5]

SECTION B

(a) Consider a collection of newspaper articles ranging over a period of 10 years.

Assume that after text processing and indexing the collection of articles in the first 5 years, you observe that it contains 400 million tokens with a lexicon size (number of unique words) of 1 million words.

Assume that the collection grows uniformly every year (the number of tokens in the collection grows uniformly). Use Heaps' Law (V = k nβ with β= 0.5) to predict the number of tokens as well as the lexicon size of the collection after indexing its full content (10 years). Justify your answer.

[3]

(b) The probability ranking principle is at the heart of probabilistic models in

information retrieval.

Using your own words, provide a short statement summarising the probability ranking principle. Explain why it is fundamental and needed in probabilistic IR modelling.

Provide two examples where the ranking produced by an IR system is optimal according to the probability ranking principle, but the ranking is not satisfactory from a user’s perspective. Justify your answers.

[4]

(c)

For a particular query q, the multi-grade relevance judgements of all documents are {(d1,1),(d3, 3),(d6, 4),(d9, 3),(d11, 1),(d31, 2)}, where each tuple represents a document ID and a relevance judgment pair, and all the other documents are judged as non-relevant. Documents are judged on the scale 0-4 (0:not relevant – 4:highly relevant). An IR system returns the following ranked list with respect to this query (these are all results it returned for this query and in that order):

d31, d22, d3, d6, d15

Assume that 1/log2 (rank) is used as the discount factor. You might wish to note that: log2 2 = 1; log2 3 = 1.59; log2 4 = 2; log2 5 = 2.32; log2 6 = 2.59.

(i) Compute the Normalised Discounted Cumulative Gain (NDCG) at each rank position for the given query. In your answer, provide the ideal DCG values for the perfect ranking for the given query. Show your calculations.

[4]

(ii) The DCG scores are normalised in NDCG. Explain why this normalisation

is needed. Why is such a normalisation not needed in MAP?

[3]

(d) URL length has been shown to be an important feature for some Web search tasks. Briefly discuss which types of information needs on the Web, the URL length feature is most appropriate for.

Consider a linear learning to rank model for Web search using 4 features: PL2, Proximity, PageRank and URL length. Using such a model, explain the main disadvantage of using linear learning to rank models in Web search.

[5]

(e) A posting list for a term in an inverted index contains the following three entries: id=4 tf=3 id=8 tf=3 id=10 tf=4

Applying the delta compression of ids, show the binary form of the unary compressed posting list. What is the resulting (mean) compression rate, in bits per integer?

[4]

(f) You are working on a search engine, and you noticed that a document entitled "City

of York's (England) new COVID restrictions" is retrieved when a user searches for "New York City COVID" (In this case, the user is looking for information about the COVID in the US city).