Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INFS7410 Project - Part 2

version 1.1

Preamble

The due date for this assignment is 27 October 2023 16Ï00 Eastern Australia Standard Time.

This part of the project is worth 20% of the overall mark for INFS7410 (part 1 + part 2 = 40%). A detailed marking sheet for this assignment is provided alongside this notebook. The project is to be completed    individually.

We recommend that you make an early start on this assignment and proceed in steps. There are several activities you may have already tackled, including setting up the pipeline, manipulating the queries,

implementing some retrieval functions, and performing evaluation and analysis. Most of the assignment relies on knowledge and code you should have already experienced in the computer practicals; however, there are   some hidden challenges here and there that you may require some time to solve.

Aim

Project aim: The aim of this project is for you to implement several neural information retrieval methods, evaluate them and compare them in the context of a multi-stage ranking pipeline.

The specific objectives of Part 2 are to:

   Set up your infrastructure to index the collection and evaluate queries.

   Implement neural information retrieval models (only inference).

   Examine your ability to perform evaluation and analysis when different neural models are used.

The Information Retrieval Task: Web Passage Ranking

As in part 1 of the project, in part 2 we will consider the problem of open-domain passage ranking in answer to web queries. In this context, users pose queries to the search engine and expect answers in the form of a

ranked list of passages (maximum 1000 passages to be retrieved).

The provided queries are actual queries submitted to the Microsoft Bing search engine. There are approximately 8.8 million passages in the collection, and the goal is to rank them based on their relevance to the queries.

What we provide you with:

Files from practical

   A collection of 8.8 million text passages extracted from web pages   collection.tsv — provided in Week 1).

   Pytorch file for ANCE model(refer to  week10-prac ).

   Standard DPR model, use  BertModel.from_pretrained("ielabgroup/StandardBERT- DR").eval() to load this model.

Extra files for this project

   A query dev file that contains 30 queries for you to perform retrieval experiments

data/dev_queries.tsv ).

   A query dev file that contains 30 queries (same query ids with previous one, but with typos in the query text ( data/dev_typo_queries.tsv )

   A qrel file that contains relevance judgements for you that can be used to tune your methods for dev queries  data/dev.qrels ). 

   A leaderboard system for you to evaluate how well your system performs.

   A test query file that contains 60 queries for you to generate run files to submit to the leaderboard  data/test_queries.tsv ). 

   This jupyter notebook, which you will include inside your implementation, evaluation and report.

   An hdf5 file that contains TILDEv2 pre-computed terms weights for the collection. Download from this link

   Typo-aware DPR model, use  BertModel.from_pretrained("ielabgroup/StandardBERT-DR- aug").eval() to load this model.

Put this notebook and the provided files under the same directory.

What you need to produce

You need to produce:

   Correct implementations of the methods required by this project's specifications.

   An explanation of the retrieval methods used, including the formulas that represent the models you

implemented and the code that implements that formula, an explanation of the evaluation settings

followed, and a discussion of the findings. Please refer to the marking sheet to understand how each of these requirements is graded.

You are required to produce both of these within this jupyter notebook.

Required methods to implement

In Part 2 of the project, you are required to implement the following retrieval methods as two-stage ranking    pipelines (bm25 + one dense retriever). All implementations should be based on your code (except for BM25, where you can use the Pyserini built-in SimpleSearcher).

). ANCE Dense Retriever: Use ANCE to re-rank BM25 top-k documents. See the practical in  Week 10  for background information.

*. Standard DPR Dense Retriever: Use standard DPR to re-rank BM25 top-k documents. See the practical in  Week 10 for background information.

+. Typo-aware DPR Dense Retriever: typo-aware DPR is a DPR model that is fine-tuned with augumented typos in the training samples, please use this model (provided in the project) to re-rank BM25 top-k

documents, the inference is the same to standard DPR Dense Retriever.

,. TILDEv2: Use TILDEv2 to re-rank BM25 top-k documents. See the practical in  Week 10  for background information.

For TILDEv2, unlike what you did in practical, we offer you the pre-computed term weights for the whole

collection (for more details, see the   Initial packages and functions cell). This means you can have a   fast re-ranking speed for TILDEv2. Use this advantage to trade off effectiveness and efficiency for your ranking pipeline implementation.

You should have already attempted many of these implementations above as part of the computer prac exercises.

Required evaluation to perform

In Part 2 of the project, you are required to perform the following evaluation: we consider two types of queries,  one of which contains  typos  (i.e. typographical mistakes, like writing  iformation  for  information , and another one with the typos resolved. An important aspect of the evaluation in the project is to compare the

retrieval behaviour of search methods on queries with and without typos (note this is the same as project part 1).

). For all methods, evaluate their performance on  data/dev_typo_queries.tsv  (queries with typos) and  data/dev_queries.tsv  (the same queries, but typos are corrected), using  data/dev.qrels with     four evaluation metrics (see below).

*. Report every method's effectiveness and efficiency (average query latency) on the

data/dev_queries.tsv  (no need for typo queries) and the corresponding cut-offk for reranking into a table. Perform statistical significance analysis across the results of the methods and report them in the tables.

+. Produce a gain-loss plot that compares the most and least effective ones of the four required methods above in terms of nDCG@10 on  data/dev_typo_queries.tsv .

,. Comment on trends and differences observed when comparing your findings.

   Does the typo-aware DPR model outperform the others on the  data/dev_typo_queries.tsv  queries?

   When evaluating the  data/dev_queries.tsv queries, is there any indication that this model loses its effectiveness?

   Is this gain/loss statistically significant? (remember to perform a t-test as well for this task).

.. (optional) submit your runs on the  data/test_queries.tsv  based on your implemented methods

from the dev sets to the leaderboard system (not counted in your mark for this assignment, but the top-ranked student on the leaderboard could request for a recommendation letter from Professor Guido Zuccon). The submission link is: https://infs7410.uqcloud.net/leaderboard/, other instructions refer to Project 1.

Regarding evaluation measures, evaluate the retrieval methods with respect to nDCG at 10 ( ndcg_cut_10 ), reciprocal rank at 1000    recip_rank ), MAP   map ) and Recall at 1000    recall_1000 ).

For all statistical significance analysis, use a paired t-test and distinguish between p<0.05 and p<0.01.

How to submit

You will have to submit one file:

). A zip file containing this notebook (.ipynb) and this notebook as a PDF report. The code should be able  to be executed by us. Remember to include all your discussion and analysis in this notebook and report, not as a separate file.

   Tips: for printing as a pdf, you can first   save and export as HTML  in jupyter and use the browser's  print function to save as a pdf.

*. It needs to be submitted via the linkin the INFS7410 BlackBoard site by 27 October 2023, 16Ï00 Eastern Australia Standard Time, unless you have been given an extension (according to UQ policy), before the   due date of the assignment.

Initial packages and functions 

Unlike prac week 10 which we compute contextualized term weights with TILDEv2 in an "on-the-fly" manner. In this project, we provide an hdf5 file that contains pre-computed term weights for all the passages in the

collection.

Frist, pip install the h5py library:

In [2]: !pip install h5py

Collecting h5py

Dow█(l)█(d)█(i)█(g)█(h)█(5)█(p)█(y)█(3)█(4)█(0)█(p)█(3)█(7)█(p)█(3)█(7)o2(s)x9(_)1M(0)B(_)91(_)0(x)84(6)_MB(64)s(w)he(l)t20(.):0(9)0(M):(B)1

Collecting cached-property

Using cached cached_property-1.5.2-py2.py3-none-any.whl (7.6 kB)

Requirement already satisfied: numpy>=1.14.5 in /Users/s4416495/anaconda3/envs/infs7410/lib/pyth

on3.7/site-packages (from h5py) (1.21.1)

Installing collected packages: cached-property, h5py

Successfully installed cached-property-1.5.2 h5py-3.4.0

The following cell gives you an example of how to use the file to access token weights and their corresponding token ids given a document id.

Note: make sure you have already downloaded the hdf5 file introduced above and placed it in a valid location


In [18]: import h5py

from transformers import BertTokenize r

f = h5py.File("tildev2_weights.hdf5", 'r')

weights_file = f [ 'documents'][:]  # load the hdf5 file to the memory.

docid = 0

token_weights, token_ids = weights_file [docid]

tokenize r = BertTokenize r.from_pretrained( 'bert-base-uncased')

for token_id, weight in zip(token_ids.tolist(), token_weights):

print(f"{tokenize r.decode([token_id])}: {weight}")


presence: 3.62109375

communication: 7.53515625

amid: 5.79296875

scientific: 6.140625

minds: 6.53515625

equally: 3.400390625

important: 6.296875

success: 7.19140625

manhattan: 9.015625

project: 5.45703125

scientific: 5.1640625

intellect: 7.328125

cloud: 6.1171875

hanging: 3.318359375

impressive: 6.5234375

achievement: 6.48828125

atomic: 8.421875

researchers: 4.9375

engineers: 6.203125

what: -1.1708984375

success: 6.421875

truly: 3.67578125

meant: 4.25

hundreds: 3.19140625

thousands: 2.98828125

innocent: 5.12890625

lives: 3.029296875

ob: 2.35546875

##lite: 1.427734375

##rated: 2.828125

importance: 7.96484375

purpose: 4.69140625

quiz: 3.28515625

scientists: 5.0390625

bomb: 3.7109375

genius: 3.8828125

development: 2.55859375

solving: 3.224609375

significance: 3.90625

successful: 5.0703125

intelligence: 5.35546875

solve: 2.751953125

effect: 1.2392578125

objective: 2.2265625

research: 1.953125

: -2.36328125

_

accomplish: 2.759765625

brains: 4.046875

progress: 1.6943359375

scientist: 3.0234375

Note, these token_ids include stopwords' ids, remember to remove stopwords' ids for query tokens.

In [ ]: # Import all your python libraries and put setup code here.

Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., ANCE

In [ ]: # Put your implementation of methods here.

When you have described and provided implementations for each method, include a table with statistical analysis here.

For convenience, you can use tools like this one to make it easier:

https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html