闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

IFN647 Text, Web and Media Analytics

Assignment 1

2022

Required to be submitted:

1. Please put your outputs into text files specified in the question descriptions and put all .py files, data and a “readme.txt” in a folder (e.g., Your Surname_code), where "readme.txt" contains a short user manual to help your tutor run your Python code. Then zip all .txt files and the folder into a zip file named as “your student ID_Surname_Asm1.zip”.

2. Submit your zip file for this assignment in BB before 11.59pm on 6 May 2022.

3. Answer all three questions (9 tasks).

4. See the marking guide for more details on the distribution of marks and marking criteria.

Individual working: You should work on this assignment individually.

Due date: Friday week 8 (6 May 2022)

Weighting: 20% ofthe assessment for IFN647.

Dataset (Rnews_v1 document collection)

• You will be working with a sample dataset which is a small subset of XML documents (TREC RCV1 data collection), which is a pre-tokenized version of (for convenience, and for copyright reasons). The dataset can be downloaded from Blackboard.

You are asked to design Python code for three questions (9 tasks). You can add new variables, functions, methods, or update function parameters. However, you should provide comments to clearly describe why you are doing this.

Question 1. Document & query parsing

The motivation for Question 1 is to design your own document and query parsers. So please don't use python packages that we didn't use in the workshop.

Task 1.1: Define a document parsing functionparse_rcv_coll(inputpath, stop_words) to parse a data collection (e.g., Rnews_v1 dataset), where parameter inputpath is the folder that stores a set of XML files, and parameter stop_words is a list of common English words (you may use the

file 'common-english-words.txt' to find all stop words). The following are the major steps in the document parsing function:

Step 1) The function reads XML files from inputpath (e.g., Rnews_v1). For each file, it finds the docID (document ID) and index terms, and then represents it in a BowDoc Object.

You need to define a BowDoc class by using Bag-of-Words to represent a document:

• BowDoc needs a docID variable which is simply assigned by the value of ‘itemid’ in <newsitem …>.

• In this task, BowDoc can be initialled with an attribute docID; an empty dictionary (the variable name is terms) of key-value pair of (String term: int frequency); and doc_len (the document length) attribute.

• You may define your own methods, e.g., getDocId() to get the document ID.

Step 2) It then builds up a collection of BowDoc objects for the given dataset, this collection can be a dictionary structure (as we used in the workshop), a linked list, or a class BowColl for storing a collection of BowDoc objects. Please note the rest descriptions are based on the

dictionary structure with docID as key and BowDoc object as value.

Step 3) At last, it returns the collection ofBowDoc objects.

You also need to follow the following requirements to define this parsing function:

Please use the basic text pre-processing steps, such as tokenizing, stopping words removal and stemming ofterms.

Tokenizing – (please provide a definition of a word, and describe it in a Python comment)

• You need to tokenize at least the ‘<text>…</text>’ part of document, exclude all tags, and discard punctuations and/or numbers based on your definition of words.

• Define method addTerm() for class BowDoc to add new term or increase term frequency when the term occur again.

Stopping words removal and stemming ofterms –

• Use the given stopping words list (“common-english-words.txt”) to ignore/remove all stopping words. Open and read the given file of stop-words and store them into a

list stopwordList. When adding a term, please check whether the term exists in the stopwordList, and ignore it if it is in the stopwordList.

• Please use porter2 stemming algorithm to update BowDoc’s terms.

Task 1.2: Define a query parsing functionparse_query(query0, stop_words), where we assume the original query is a simple sentence or a title in a String format (query0), and stop_words is a list of stop words that you can get from 'common-english-words.txt'.

For example, let query0 =

'CANADA: Sherritt to buy Dynatec, spin offunit, canada.' the function will return a dictionary

{'canada': 2, 'sherritt': 1, 'buy': 1, 'dynatec': 1, 'spin': 1, 'unit': 1}

Please note you should use the same text transformation technique as the document, i.e., tokenizing steps for queries must be identical to steps for documents.

Task 1.3: Define a main function to test functionparse_rcv_coll( ). The main function uses the provided dataset, calls function parse_rcv_coll() to get a collection of BowDoc objects. For each document in the collection, firstly print out its docID, the number of index terms and the total number of works in the document (doc_len). It then sorts index terms (by frequency) and

prints out a term:freq list. At last, it saves the output into a text file (file name is “your full name_Q1.txt”).

Sample Example ofoutput for file “807606newsML.xml”

Document 807606 contains 60 terms and have total 187 words

bid : 7

bank : 6

insur : 5

great : 5

west : 5

royal : 5

london : 4

quot : 4

trilon : 3

tender : 3

stake : 3

offer : 3

tuesday : 2

percent : 2

billion : 2

canada : 2

match : 2

myhal : 2

june : 2

sharehold : 2

per : 2

share : 2

financi : 1

corp : 1

group : 1

lifeco : 1

inc : 1

doe : 1

posit : 1

wait : 1

see : 1

fail : 1

…

Question 2. Tf*idf based IR model

Tf*idf is a popular term weighting method, which uses the following Eq. (1) to calculate a weight for term k in a document i, where the base of log is 10. You may review lecture notes to get the meaning of each variable in the equation.

(1)

Task 2.1: Define a function calc_df(coll) to calculate document-frequency (df) for a given BowDoc collection coll and return a {term:df, … } dictionary.

Example of output for this task

There are 10 documents in this dataset

The following are the terms’ document-frequency:

share: 5

market: 4

compani: 4

three: 4

royal: 4

aug: 1

articl: 1

deviat: 1

swap: 1 Task 2.2: Use Eq (1) to define a function tfidf(doc, df, ndocs) to calculate tf*idf value (weight) of every term in a BowDoc object, where doc is a BowDoc object or a dictionary of {term:freq, … }, df is a {term:df, … } dictionary, and ndocs is the number of documents in a given BowDoc collection. The function returns a {term:tfidf_weight , … } dictionary for the given document doc. Task 2.3: Define a main function to print out top 12 terms (with its value of tf*idf weight) for each document in Rnews_v1 if it has more than 12 terms and save the output into a text file (file name is “your full name_Q2.txt”). You also need to implement a tf*idf based IR model. You can assume titles ofXML documents (the <title> …</title> part) are the original queries, and test at least three titles. You need to use function parse_query() that you defined for Question 1 to parse original queries. For each query, please use the abstract model of ranking (Eq. (2)) to calculate a score for each document.

(2)

At last, append the output (in descending order) into the text file (“your full name_Q2.txt”).

Example of output for this task

bid : 0.27477268692397266

insur : 0.2433890479622151

great : 0.2433890479622151

west : 0.2433890479622151

myhal : 0.2259384978055295

per : 0.2259384978055295

trilon : 0.19574301597552804

stake : 0.19574301597552804

offer : 0.19574301597552804

billion : 0.1579242327908045

match : 0.1579242327908045

bank : 0.14824878002268424

tender : 0.14642954913050904

royal : 0.13856709051308835

doe : 0.13344291648101658

wait : 0.13344291648101658

fail : 0.13344291648101658

georg : 0.13344291648101658

…

The Ranking Result for query: BELGIUM: MOTOR RACING-LEHTO AND SOPER HOLD

ON FOR GT VICTORY.

741299 : 0.7258206779073599

809481 : 0.06517635855336815

807600 : 0.038674645645810815

780723 : 0

741309 : 0

780718 : 0

783803 : 0

809495 : 0

783802 : 0

807606 : 0

…

Question 3. BM25-based IR model

BM25 IR model is a popular and effective ranking algorithm, which uses the following Eq. (3) to calculate a document score or ranking for a given query Q and a document D, where the base of log is 2. You may review lecture notes to get the meaning of each variable in the equation.

(3)

You can use the BowDoc collection to work out some variables, such as N and ni (you may assume R = ri = 0).

Task 3.1: Define a Python function avg_doc_len(coll) to calculate and return the average document length of all documents in the collection coll.

• In the BowDoc class, for the variable doc_len (the document length), add accessor (get) and mutator (set) methods for it.

• You may modify your code defined in Question 1 by calling the mutator method of doc_len to save the document length in a BowDoc object when creating the BowDoc object. At the same time, sum up every BowDoc’s doc_len as totalDocLength, then at the end, calculate the average document length and return it.

Task 3.2: Use Eq (3) to define a python function bm25(coll, q, df) to calculate documents’ BM25 score for a given original query q, where df is a {term:df, … } dictionary. Please note you should parse query using the same method as parsing documents (you can call function parse_query() that you defined for Question 1). For the given query q, the function returns a dictionary of {docID: bm25_score, … } for all documents in collection coll.

Task 3.3: Define a main function to implement a BM25-based IR model to rank documents in the given document collection News_v1 using your functions.

• You are required to test all the following queries:

o This British fashion

o All fashion awards

o The stock markets

o The British-Fashion Awards

• The BM25-based IR model needs to print out the ranking result (in descending order) of top-5 possible relevant documents for a given query and append outputs into the text file (“your full name_Q3.txt”).

Example of output for this question (Note that you may get negative BM25 scores because N is not large enough and ni can be close to N. You can fix this by increasing N)