Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP6714 2022T3 Project Specification

COMP6714 2022T3 Project: Westlaw alike queries

As presented in the lecture, Westlaw is a popular commercial information retrieval system. You can search for documents by Boolean Terms and Connector queries. For example:

STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

where STATUTE, ACTION, FEDERAL, TORT, CLAIM are the search terms and  space, /S, /2, /3 are the connectors.

In this project, you are going to implement a retrieval system in Python3 called                   SimplyBoolean, which you have already encountered in the assignment. As a core               requirement for this project, you must implement SimplyBoolean using a positional index.

SimplyBoolean is a retrieval system that supports Westlaw alike queries. It supports the following reduced set of connectors:

" ", space, +n, /n, +s, /s, &

as well as parentheses. Note that the connectors of your query will be processed in exactly the order above. Further details of these connectors can be found in the Quick Reference  Guide of WestLaw available from WebCMS.

Different to Westlaw, SimplyBoolean does not support various forms of search terms, except a normal search term (i.e. single-word, those without " ") and a phrase.

Term matching (including terms in a phrase) in SimplyBoolean follows the below:

Search in SimplyBoolean is case insensitive.

Full stops for abbreviations are ignored. e.g., U.S., US are the same.

Singular/Plural is ignored. e.g., cat, cats, cat's, cats' are all the same.

Tense is ignored. e.g., breaches, breach, breached, breaching are all the same.              A sentence can only end with a full stop, a question mark, or an exclammation mark.   Except the above, all other punctuation should be treated as token dividers.                 All (whole) numeric tokens such as years, decimals, integers are ignored. You should    not index these tokens and hence should not consider them for proximity queries       such as +n. E.g. you should not index '123' (wholly numeric) but should index 'abc123' (partially numeric).

You are provided with approximately 1000 small documents (named with their document IDs) available in ~cs6714/reuters/data. You can find these files by logging into CSE            machines and going to folder ~cs6714/reuters/data. Your submitted project will be tested against a similar collection of up to 1000 documents (i.e., we may replace some of these   documents to avoid any hard-coded solutions).

Your submission must include 2 main programs: index.py and search.py as described below.

The Indexer

$ python3 index.py [folder-of-documents] [folder-of-indexes]

where [folder-of-documents] is the path to the directory for the collection of documents to be indexed and [folder-of-indexes] is the path to the directory containing the index file(s)   be created. All the files in [folder-of-documents] should be opened as read-only, as you      may not have the write permission for these files. If [folder-of-indexes] does not exist,          create a new directory as specified. You may create multiple index files although too many  index files may slow down your performance. The total size of all your index files generated shall not exceed 20MB (which should be plenty for this project).

After the indexing is completed, it will output the total number of documents, the total  number of tokens (after any preprocessing and filtering) to be indexed, and the total      number of terms to be indexed. The following example illustrates the required input and output formats:

$ python3 index.py ~/Desktop/MyDataFolder ./MyTestIndex

Total number of documents : 672

Total number of tokens : 638321

Total number of terms : 13297

Note:the output ofindex.py endswith one newline ('\n') character.

Searching

$ python3 search.py [folder-of-indexes]

where [folder-of-indexes] is the path to the directory containing the index file(s) that are     generated by the indexer. After the above command is executed, it will accept a search        query from the standard input and output the result to the standard output as a sequence  of document names (the same as their document IDs) one per line and sorted in an             ascending order by their numeric values (e.g., 72 will be output before 125). It will then       continue to accept the search queries from the standard input and output the results to the standard output until the end (i.e., a Ctrl-D). The following example illustrates the required  input and output formats:

$ python3 search.py ~/Proj/MyTestIndex

company inc & revenue

9

17

33

185

share +5 investor & US

3

67

271

365

499

625

$

Chaining Mixed Connectors

Example:

a b /s c

Explanation: As per the WestLaw guide, following the precedence rules for this example,   OR (the space) has higher priority. Because OR is a boolean connector, we can re-write the query into an equivalent form below:

(a /s c) (b /s c)

Chaining Non-boolean Connectors

Example:

a +n b /s c

Explanation: The connector precedence here lies with '+n' first, then '/s'. This query can be understood as the equivalent of doing (a +n b) first, then only among the documents (and more importantly, their postings), we will output the documents for which (a /s c) (for the  same posting 'a') or (b /s c) (for the same posting 'b') is true. To further explain in english,  the query wants documents where either:

there is 'a' which sentence with 'c' there is 'a' which sentence with 'c'

precedes 'b' by at most 'n' terms, that same occurrence of 'a' is in a precedes 'b' by at most 'n' terms, that same occurrence of 'b' is in a

Another example:

a +n (b /s c)

Explanation: With the presence of parentheses, this query will have (b /s c) processed first. From the resulting document postings, we will output the documents for which (a +n b)   (for the same posting 'b') or (a +n c) (for the same posting 'c') is true. To further explain in english, the query wants documents where either:

there is 'b' in a after 'a'           there is 'b' in a after 'a'

sentence with 'c', that same occurrence of 'b' occurs at most 'n' terms sentence with 'c', that same occurrence of 'c' occurs at most 'n' terms

Marking

This assignment is worth 30 points.

Your submission will be tested and marked on CSE linux machines using Python3.              Therefore, please make sure you have tested your solution on these machines using           Python3 before you submit. You will not receive any marks if your program does not work on CSE linux machines and only works in other environment such as your own laptop.

Full marks will be awarded to submissions that follow this specification and pass all the test cases.

Although we do not measure the runtime speed, your indexing program will be terminated if it does not end after one minute, and you will receive zero marks for the project (since   we cannot get the index generated successfully for further testing); and your search program will be terminated if it does not end after 10 seconds per search query, and you will receive zero marks for that search query.

There will be test cases for each connector in addition to the test cases for mixing several  connectors. Therefore, if you are unable to implement all the required connectors, try your best to implement as many as you can.

Submission

Deadline: Monday 7th November 12:00pm AEST (noon).

The penalty for late submission of assignments will be 5% (of the worth of the assignment) subtracted from the raw mark per day of being late. In other words, earned marks will be   lost. No assignments will be accepted later than 5 days after the deadline.

Use the give command below to submit the assignment:

give cs6714 proj *.py

Make sure to use classrun to check your submission to make sure that you have submitted all the required files for index.py and search.py to run properly.

6714 classrun -check proj

Plagiarism

The work you submit must be your own work. Group submissions will not be allowed. Your program must be entirely your own work. Plagiarism detection software will be used to       compare all submissions pairwise (including submissions for similar assignments in              previous years, if applicable) and serious penalties will be applied, particularly in the case of repeat offences.

Do not copy ideas or code from others.

Do not use a publicly accessible repository or allow anyone to see your code.

Please refer to the Student Conduct section in the course outline to help you understand what plagiarism is and how it is dealt with at UNSW.

Finally, reproducing, publishing, posting, distributing or translating this assignment is an   infringement of copyright and will be referred to UNSW Student Conduct and Integrity for action.