Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


# INFS7410 Project - Part 1

 

_version 1.0_

 

### Preamble

 

The due date for this assignment is **9 September 2022 16:00 Eastern Australia Standard Time**.

 

This part of project is worth 20% of the overall mark for INFS7410 (part 1 + part 2 = 40%). A detailed marking sheet for this assignment is provided alongside this notebook. The project is to be completed individually.

 

We recommend that you make an early start on this assignment, and proceed by steps. There are a number of activities you may have already tackled, including setting up the pipeline, manipulating the queries, implement some retrieval functions, and performing evaluation and analysis. Most of the assignment relies on knowledge and code you should have already have experienced in the computer practicals, however there are some hidden challenges here and there that you may require some time to solve.

 

### Aim

 

Project aim: The aim of this project is to implement a number of classical information retrieval methods, evaluate them and compare them in the context of a real use-case.

 

##### Project Part 1 aim

 

The aim of Part 1 is to:

 

* Setup your infrastructure to index the collection and evaluate queries.

* Implement common information retrieval baselines.

* Tune your retrieval implementations to improve their effectiveness.

* Implement rank fusion methods.

 

### The Information Retrieval Task: Web Passage Ranking

 

In this project we will consider the problem of open-domain passage ranking in answer to web queries. In this context, users pose queries to the search engine and expect answers in the form of a ranked list of passages (maximum 1000 passages to be retrieved).

 

The provided queries are real queries submitted to the Microsoft Bing search engine. In the collection, there are approximately 8.8 million passages and the goal is to rank them based on their relevance to the queries.

 

 

### What we provide you with:

 

#### Files from practical

* A collection of 8.8 million text passages extracted from web pages (`collection.tsv`— provided in Week 1).

* A query file that contains 43 queries for you to perform retrieval experiments (`queries.tsv`— provided in Week 2).

* A qrel file that contains relevance judgements for you to tune your methods (`qrels.txt`— provided in Week 2).

 

#### Extra files for this project

* A leaderboard system for you to evaluate how well your system performs.

* A test query file that contains 54 queries for you to generate run files to submit to the leaderboard (`test_queries.tsv`).

* This jupyter notebook, which you will include inside it your implementation and report.

 

Put this notebook and provided files under the same directory.

 

#### What you need to produce

 

You need to produce:

 

* Correct implementations of the methods required by this project specifications.

* An explanation of the retrieval methods used, including the formulas that represent the models you implemented and code that implements that formula, an explanation of the evaluation settings followed, and a discussion of the findings.

 

You are required to produce both of these within this jupyter notebook.

 

#### Required methods to implement

 

In Part 1 of the project you are required to implement the following retrieval methods. All implementations should be based on your own code.

 

1. BM25: Create your own implementation, do not use the Pynserini API's implementation. See the videos in Week 3 for background information.

2. Pseudo-relevance feedback using BM25 for query expansion: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.

3. IDF-r query reduction: create your own implementation using the Pyserini API to extract index statistics. See Week 5 practical for background information.

4. The rank fusion method Borda; you need to create your own implementation of this. See Week 4 practical for background information.

5. The rank fusion method CombSUM; you need to create your own implementation of this. See Week 4 practical for background information.

6. The rank fusion method CombMNZ; you need to create your own implementation of this. See Week 4 practical for background information.

 

 

For your BM25, query expansion and query reduction implementations, you are also required to tune the parameters of these methods. You must perform a parameter search over at least 5 sensibly chosen parameter values depending on the method (10 when the method has two parameters).

 

For the rank fusion methods, consider fusing the highest performing tuned run from each of the BM25, query expansion and query reduction implementations.

 

You should have already attempted many of these implementations above as part of the computer pracs exercises.

 

#### Required evaluation to perform

 

In Part 1 of the project you are required to perform the following evaluation:

 

1. For all methods, tune using `queries.tsv` and `qrels.txt` (i.e., use this data to tune any parameter of a retrieval model, e.g. $b$ and $k1$ for BM25, etc.) and submit your runs on the `test_queries.tsv` using the parameter values you selected from the `queries.tsv` to the learderboard system.

2. Report the results of every method on the `queries.tsv` (only for the run you selected the tuned paramters from) separately, into a table. Perform statistical significance analysis across the results of the methods and report them in the tables.

3. Produce a gain-loss plot that compares BM25 vs. Pseudo-relevance feedback query expansion using BM25; and plots that compare BM25 vs. each rank fusion method on the dev set.

4. Comment on trends and differences observed when comparing your findings. Is there a method that consistently outperform the others on the `queries.tsv` and the `test_queries.tsv`?

5. Provide insights of whether rank fusion works, or if it does not, e.g., with respect to runs to be considered in the fusion process, queries, etc.

 

In terms of evaluation measures, evaluate the retrieval methods with respect to nDCG at 10 (`ndcg_cut_10`). You should use this measure as the target measure for tuning. Also compute reciprocal rank at 1000 (`recip_rank`),  MAP (`map`) and Recall at 1000 (`recall_1000`).

 

 

For all gain-loss plots, produce them with respect to nDCG at 10.

 

For all statistical significance analysis, use paired t-test; distinguish between p<0.05 and p<0.01.

 

#### How to submit

 

You will have to submit one file:

 

1. A zip file containing this notebook (.ipynb) and this notebook **as a PDF document**. The code should be able to be executed by us. Remeber to include all your discussion and analysis also in this notebook and not as a separate file.

 

It needs to be submitted via the relevant Turnitin link in the INFS7410 BlackBoard site, by **3 September 2021, 16:00 Eastern Australia Standard Time**, unless you have been given an extension (according to UQ policy), *before* the due date of the assignment.

 

---

 

### Leaderboard Challenge

 

As part of this project, we present you the leaderboard challenge, where you can submit your best runs and compare them with others. You can submit multiple times during the whole process of your project.

 

For models, you can use any retrieval or rerank or fusion methods (including those you implement for the pracs and for the assessment), even combine multiple models together, not limited to the models we have learned in this class. You are allowed to be creative and come up with your own retrieval or rerank models.

 

Although multiple run files can be submitted, we recommend you to run your models with the queries we provided in pracs first, only to check if your model is valid. Then you can run on the test queries for your project, and submit it to leaderboard.

 

__The submission link is:__ https://infs7410.uqcloud.net/leaderboard/

 

After opening the link, you will see the main page as below:

 

![main](images/main.png)

 

The submitted runs will appear here. You can see we have already submitted a sample run as `pyserini_BM25`. The name of the run is in the format of `student name(runtag)`, so make sure you name your run correctly in the run file's 6th column. If you use the same run tag for all the runs you submit, the latter ones will overwrite the previous runs: you will only have one run at the end. So to submit different runs, make sure to use different run tags.

 

If you want to make a submission, type the password in the textbox, and hit enter. (__The password is your student number__) It will take you to your private submission page to submit your run, as below:

 

![submit](images/submit.png)

 

In this page, choose the run file you wish to upload by clicking the `Browse` button, then click `Submit` to submit the run.

 

We highly encourage you to submit your best runs to the leaderboard, so you will know how your model performs compared to your classmates.

 

We will also give prizes to students who have the top 3 best runs. More specifically:

 

Leaderboard winners: Certificate of award to the students with the best 3 runs (winner, + 2 runner-up) in the leaderboard (test queries) by the time the assessment closes (end of semester). +2% final mark provided to the student whose run has the highest effectiveness (ndcg_cut_10), +1% final mark to the student whose run has the second highest effectiveness (ndcg_cut_10) in the leaderboard by the end of the semester. A/Prof. Guido can provide recommendation letters to students with top 3 runs.

 

This leaderboard will be running throughout the semester.

 

Have fun!

## Initialise packages and functions

----

You will need to decide which index, stemming algorithm and keeping stopwords or not in the following cell. You may want to try out different indexes and if you don't have one, follow week1 prac to create one (remember to add `-storeDocvectors`).

stemming = None    # None or 'poter' or anything else

stopwords = False  # False or True

index = 'indexes/lucene-index-msmarco-passage-xxxx/'

Run the following cell to load and cache some useful packages and statistics that you will use later.

from pyserini.search import SimpleSearcher

from pyserini.analysis import Analyzer, get_lucene_analyzer

from pyserini.index import IndexReader

from tqdm import tqdm

 

lucene_analyzer = get_lucene_analyzer(stemming=stemming, stopwords=stopwords)

analyzer = Analyzer(lucene_analyzer)

searcher = SimpleSearcher(index)

searcher.set_analyzer(lucene_analyzer)

index_reader = IndexReader(index)

 

 

# Create document frequency dictionary to speed up scoring later, this will take around 2 min.

df_dict = {}

for term in tqdm(index_reader.terms(), desc="loading idf dictionary:"):

        df_dict[term.term] = term.df

 

# cache document length and docids for the collection, this will take around 2 mins.

doc_len_dict = {}

doc_id_dict = {}

with open ('collection/collection.tsv', 'r') as f:

    lines = f.readlines()

    for line in tqdm(lines, desc="loading doc_length dictionary:"):

        docid, text = line.split('\t')

        doc_len_dict[docid] = len(text.split())

        internal_id = index_reader.convert_collection_docid_to_internal_docid(docid)

        doc_id_dict[internal_id] = docid

Understand and run the following cell to define the search function.

 

**NOTE**: This search function is different from the search function in week3 prac. When you implement methods yourself, make sure to use this search function which we implemented with iterating posting lists, do not use the search function from Week 3 prac, as the Week 3 prac is only re-ranking BM25 results.

def search(query: str, k: int=1000, scorer=None):

    """

    Inputs:

        query (str): the query string to perform the search.

        k (int): the number of documents to be returned.

        scorer: your implemented scoring function, such as bm25.

    

    Output:

        results (list): the sorted result list, a list of tuples.

                        The first element in the tuples is the docid,

                        the second is the doc score.

    """

    

    assert scorer is not None

    print("-----------------------------------------------------")

    print("Current query:", query)

 

    # get the analyzed term list

    q_terms = analyzer.analyze(query)

    doc_socres = {}

    for term in q_terms:

        # get the posting list for the current term

        postings_list = index_reader.get_postings_list(term, analyzer=None)

        if postings_list is not None:

            # get the document frequency of the current term

            df = df_dict[term]

 

            # iterate the posting list

            for posting in tqdm(postings_list, desc=f"Iterate posting for term '{term}'"):

                internal_id = posting.docid

                # convert pyserini internal docid to the actual docid

                docid = doc_id_dict[internal_id]

                tf = posting.tf

                doc_len = doc_len_dict[docid]

 

                # Call the scoring function (you will implement these below).

                score = scorer(tf, df, doc_len)

                if docid in doc_socres.keys():

                    doc_socres[docid] += score

                else:

                    doc_socres[docid] = score

 

    # Sort the results by the score.

    results = [(docid, doc_socre) for docid, doc_socre in doc_socres.items()]

    results = sorted(results, key=lambda x: x[1], reverse=True)[:k]

    return results

    print("-----------------------------------------------------")

After this line, feel free to edit this notebook however you like.

You can use the following cells as a template to guide you in completing this project.

 

---

# Import all your python libraries and put setup code here.

Double-click to edit this markdown cell and describe the first method you are going to implement, e.g., BM25

# Put your implementation of BM25 here, including parameter tuning.

When you have described and provided implementations for each method, include a table with statistical analysis here.

 

For convenience, you can use tools like this one to make it easier: https://www.tablesgenerator.com/markdown_tables, or if you are using pandas, you can convert dataframes to markdown https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html

Then you can edit this cell to describe the gain-loss plots you will create below.

# Put your implementations for the gain-loss plots here.

Then you can edit this cell to provide some analysis about if there is a method that consistently outperform the others on the dev set and the test set.

Then you can edit this cell to provide insights of whether rank fusion works, or if it does not.