Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP6714 2023T3 Project: Ranked Retrieval

In this project,you are going to implement (using Python3 in CSE linux machines)a simple search engine that ranks the output documents based on the promixity of the matching     terms.A search query in this project is a list of space-separated search terms and each

search term may contain any numeric digits or uppercase/lowercase letters,and will not    contain any punctuations.You will need to implement an indexer (to index the files)and a search program(to search the files based on the index generated by your indexer).As a     core requirement for this project,you must implement your solution using an inverted

index with positional information (for example,the positional index described in Lecture Week 1).You may also implement any additional indexes as appropriate,if you wish.

Given a search query,each matching document from the search result must contains terms that match all the search terms following the below term matching rules:

·Search is case insensitive.

·Full stops for abbreviations are ignored.e.g..U.S.,US are the same.

·Singular/Plural  is  ignored.e.g.,cat,cats,cat's,cats'are  all  the  same.

·Tense  is  ignored.e.g.,breaches,breach,breached,breaching  are  all  the  same.

·A sentence can only end with a full stop,a question mark,or an exclamation mark.

·Numeric tokens such as years,integers should be indexed accordingly and searchable. ·Commas in numeric tokens are ignored,e.g.,1,000,000 and 1000000 are the same.

·Except the above,all other punctuation should be treated as token dividers.

·Numbers with decimal places can be ignored from the index,if you wish,as a decimal number is not a valid search term (since'.is not allowed).

As a requirement of this project,the matching documents in a search result are ranked

according to the distances between the matching terms in these documents,such that

matching terms closer to each other will be ranked higher than those further apart.Further details are described below in the Ranking section.

You are provided with approximately 1000 small documents (named with their document

IDs)available in ~cs6714/reuters/data. You can find these files by logging into CSE

machines and going to folder ~cs6714/reuters/data. Your submitted project will be tested

against a similar collection of up to 1000 documents (i.e.,we may replace some of these documents to avoid any hard-coded solutions).

Your submission must include 2 main programs:index.py and search.py as described

below.It is your responsibility to submit any other auxiliary Python files if they are needed for the 2 main programs to work properly.

This project will be marked based on auto marking,and will then be checked manually for other requirements described in this specification (for example,if a positional index is

implemented).To ensure your project satisfies the input and output formatting

requirements,a simple sanity test script is available in ~cs6714/reuters/sanity.You should  run the sanity test script before you submit your solution.To run the sanity test script on a

CSE linux machine,simply go inside the folder that contains your index,py and search.py

and type:~cs6714/reuters/sanity that will run tests based on examples presented below.

Note that it is only a sanity test primarily for formatting,and you are expected to test your project more thoroughly.

The Indexer

Your indexer is run by

python3              index.py               [folder-of-documents][folder-of-indexes]

where [folder-of-documents]is the path to the directory for the collection of documents to be indexed and [folder-of-indexes]is the path to the directory where the index file(s)

should be created.All the files in [folder-of-documents]should be opened as read-only,as you may not have the write permission for these files.If [folder-of-indexes]does not exist,   create a new directory as specified.You may create multiple index files although too many   index files may slowdown your performance.The total size of all your generated index files shall not exceed 20MB (which should be plenty for this project).

After the indexing is completed,it will output the total number of documents,the total number of tokens (after any preprocessing and filtering)to be indexed,and the total

number of terms to be indexed.The following example illustrates the required input and

output formats:

$python3            index.py             ~cs6714/reuters/data./MyTestIndex

Total    number    of   documents:  1000

Total     number     of     tokens:260759

Total   number    of   terms:  9509

$

Note:Each line of the output of index py ends  with one newline (ln)character;and except for the total number of documents,your indexer may have different numbers of tokens

and terms depending on your preprocessing choices.

The Search

Your search program is run by

python3         search.py          [folder-of-indexes]

where [folder-of-indexes]is the path to the directory containing the index file(s)that are

generated by the indexer.After the above command is executed,it will accept a search

query from the standard input and output the result to the standard output as a sequence   of document names (the same as their document IDs),one per line and sorted according to their ranking as decribed in the Ranking section below.It will then continue to accept the     search queries from the standard input and output the results to the standard output until   the end (i.e.,a Ctrl-D).The following example illustrates the required input and output

formats:

$python3

Apple

1361

Malaysia 311

search.py        ~/Proj/MyTestIndex

1356

1675

2956

3051

3438

5169

5195

5216

5258

5285

5382

Australia  Technology

3454

271 billions

880

$

Ranking

A proximity distance is defined as the number of terms (ignoring decimal numbers when

counting) between each pair of terms that match two search terms.The search result is   sorted by its minimum sum of the proximity distances between the matching terms (left- to-right by query terms)in an ascending order,then by the number of matching terms

occurring in the same order as the search terms,then by the numeric values of the

documentIDs (e.g..72 will be output before 125).To elaborate this further,consider an example of 4 documents with document IDs 1-4:

DocID

3

4

Content

apple durian cherry bread egg fennel garlic ham

bread garlic ham

egg   bread   cherry   apple   egg   fennel   ham    garlic   bread

ham garlic bread

Given a search query of "garlic bread",all documents contain these two search terms.The

minimum proximity distance between the matching terms for these two terms are calculated as below:

Query:garlic bread

Expected     output:

DocID     Distance

3      0

0

2          0

2

For example,in Document 3,bread appears twice.We pick the second bread to calculate    the proximity distance as it will result in a minimum value since it is next to garlic,i.e.,with  a distance 0.We define the matching term that is used to calculate the minimum proximity

distance as the closest matching term.Although Documents 2,3,4 are all having the

minimum distance of 0,Documents 3 and 4 rank higher than Document 2 since the two

closest matching terms "garlic bread"appear in the same order as the search query,while   Document 2 has a different order (i.e.,"bread garlic").Furthermore,since Documents 3 and 4 have the same minimum distance and the closest matching terms are in the same order,  they are then sorted by their documentlDs.Finally,Document 1 has the largest minimum    distance (2)and hence has the lowest ranking.

Consider another search query "egg ham bread"for the example documents above.Only two documents contain terms matching all three search terms.

Query:egg ham bread

Expected

DocID

output:

Distance

1+1=2

2+3=5

For Document 3,there is more than one term matching egg and bread respectively.The

second egg and the second bread are chosen as the closest matching terms to achieve the

minimum sum of the proximity distances between the 3 matching terms for the 3 search terms.As a result,Document 3 has a smaller minimum distance than Document 1 and

hence it ranks higher.

Using the real documents available in ~cs6714/reuters/data,the following examples

present their expected ranked output from your search.py and you can study their ranking further by inspecting the content of their corresponding documents.

$python3                     search.py                     ~/Proj/MyTestIndex

australia    technology

4a(54)nk   expect   distribution

3077

4367

4019

875

$

Since the input is from the standard input,your search engine should be able to accept

input redirected from a file as shown below:

$cat                       testl.txt

US finance COMPANY investor

$python3                                                               search.py~/Proj/MyTestIndex<testl.txt

5171

3396

3023

5778

1682

$

Regarding the number of matching terms occurring in the same order as the search terms, the following post extracted from the Ed forum provides a good example.Consider two     documents:

1.butter            apple            duck            chicken

2.chicken            butter            apple            duck

and a search query "apple butter chicken duck".

Both documents have 1 pair in matching order.You may consider it in a similar way to the proximity calculation,i.e.pair-wise,so if apple-butter is in correct order,then +1,etc.For   this example:

·butter apple duck chicken:+1 as butter-chicken is correct order

·chicken butter apple duck:+1 as chicken-duck is correct order

Displaying Lines Containing Matching Terms

Given a search query starting with'>'and a space,in addition to the displaying the

matching document IDs,the lines of text that contain the closest matching terms are also displayed according to the following rules:

·Each matching document  ID is displayed with'>'and a space in front,followed by lines of text that containing the closest matching terms.

·For each matching document,only one line of text that contains its corresponding

·    closest matching term is displayed for each search term.

line 1 before line 2.

· In case more than one closest matching term is determined for a search term and

they are at different lines,only the first line of these lines is displayed.This includethe case when a search query only consists of one search term such that matching  terms can be found in multiple lines (refer to the apple example below).

The examples below illustrate some of these rules and the display format.

$python3      search.py      ~/Proj/MyTestIndex

>3077

The bank said it expects the distribution will be made in

>4367

Closing   is    expected   to    take   place    in    early   April    and   the

The partnership will acquire the refining and distribution

facility with U.S.and foreign banks to finance inventories and

>4019

It said it did not know when distributions would be made.

The bank said it expected to report positive earnings in

>875

prepared to negotiate a new distribution based on objective

bank debt,have increased political pressure on the country to

The   expected    drop   in   prices    could   result    in   losses    of   as

>AUStralia Technology

>3454

marketing    of    high-technology     smelting    processes     invented     in

Australia,notably the Siromelt Zinc Fuming Process.

>apple

>1361

The   department    said    stocks    of   fresh    apples    in   cold    storage

$

Marking

This project is worth 40 points. Your submission will be tested and marked on CSE linux

machines using Python3.Therefore,please make sure you have tested your solution on

these machines using Python3 before you submit.You will not receive any marks if your    program does not work on CSE linux machines and only works in other environment such   as your own laptop.Full marks will be awarded to submissions that follow this specification and pass all the test cases.

Although we do not measure the runtime speed,your indexing program will be terminated if it does not end after one minute,and you will receive zero marks for the project (since    we cannot get the index generated successfully for further testing);and your search

program will be terminated if it does not end after 10 seconds per search query,and you will receive zero marks for that search query.

Partial Marks

For this project,a search result from your search engine is only considered correct if it

contains exactly the same set of document names as the expected answer and their order

must be exactly the same as well.We will grant you some partial marks based on F-

measure to evaluate how close your result is to the expected answer (in which any correct  document names that are in wrong order will be treated as wrong documents).F-measure

=2*(precision*recall)/(precision  +recall)will  be  used  to  calculate  the  final  score  for

each test when your search results differ from the expected output.The final score for each test will be calculated by:[F-measure round to 2 decimal places]*[fullMarks of that test].    The following examples further illustrate how this scheme works.Given the expected

answer Ans,suppose that there are 6 results returned from 6 different search programs.

Ans

l

2

3

4

Searchl

Search2

3

Search3

4

3

Search4

2

4

5

Search5

2

4

Search6

2

3

5

6

Their F-measure based on their precision and recall are calculated as below:

Search1-recall:0.75    precision:1.00

fmeasure:0.86

Search2   -recall:0.25   precision:0.33

fmeasure:0.28

Search3   -recall:0.50   precision:0.50

fmeasure:0.50

Search4   -recall:1.00   precision:0.80

fmeasure:0.89

Search5   -recall:  1.00  precision:0.80

fmeasure:0.89

Search6   -recall:0.75   precision:0.60

fmeasure:0.67

If this search query is worth 2 pts,Search6 will get 0.67*2 =1.34 pts.

For search queries starting with'>'(the queries with matching lines displayed),no partial marks will be awarded.In order to get full marks,a search result has to exactly match the output of the expected answer (including the document IDs,the matching lines and their order).Therefore,please check your solution with the provided sanity test before

submitting the project (and do not leave this till last minute).For your information,search  queries starting with'>'will contribute in total less than 10 points (out of 40 points)of this project.

Python3    Libraries

As a requirement of this project,you are not allowed to use any Python libraries other than the Python Standard Library (https://docs.python.org/3.9/library/index.html)    and NLTK.

For NLTK,you may assume the entire collection (using "all")has been downloaded before

the marking process starts.i.e.,you should delete/comment all nltk.download

statements,if any,in your code before your project submission.Note that during your

project development,you may want to download only individual required packages rather than "all",as your CSE account has limited space.Please also set the quiet parameter to

True in case you forgot to remove thenltk.download statements and the download output

affects the auto-marking.For example,

nltk.download('package_name',quiet=True)

Submission

Deadline:Tuesday 7th November 12:00pm AEST (noon).

The penalty for late submission of assignments will be 5%(of the worth of the assignment) subtracted from the raw mark per day of being late.In other words,earned marks will be    lost.No assignments will be accepted later than 5 days after the deadline.

Use the give command below to submit the assignment:

give    cs6714    proj*.py

It is your responsibility to use classrun to check your submission to make sure that you have submitted all the required files for index.py and search.py to run properly.

6714 classrun -check proj

Plagiarism

The work you submit must be your own work.Group submissions will not be allowed.Your program must be entirely your own work.Plagiarism detection software will be used to

compare all submissions pairwise (including submissions for similar assignments in

previous years,if applicable)and serious penalties will be applied,particularly in the case of repeat offences.Your submissions will also be checked against any relevant code available   on the Internet.In particular:

·Do not copy ideas or code from others.

·Do not use a publicly accessible repository or allow anyone to see your code.

Please refer to the Academic Honesty and Plagarism section under Other Useful

Information  in  the  course  outline to help you understand what plagiarism is and how it is dealt with at UNSW.

Finally,reproducing,publishing,posting,distributing  or  translating  this  assignment  is  an    infringement of copyright and will be referred to UNSW Student Conduct and Integrity for action.