关键词 > CSCI381/CSCI780

CSCI381/CSCI780 Natural Language Processing Homework 2

发布时间：2023-04-25

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSCI381/CSCI780 Natural Language Processing

Homework 2

Due 05/10/2023 at 1:40pm

Due Wednesday, 05/10/2023 3:10pm in class.

Please submit your report and a printout of your code. The sheets should be stapled together. If you are unable to attend class, please slide a copy of the assignment under the door of my office (SB A332) by the due date and time.

Please also email your report as a pdf hle as a separate attachment and a tarball of your code to [email protected]. The email should be sent before 1:40 pm. If you send multiple emails, only your first (earliest) submission will be graded.

If you do not submit a paper copy of the report and code, the assignment will get a grade of 0. If you do not submit an electronic version, the assignment will get a grade of 0. If you do not submit code both electronically and as a hard copy, the assignment will get a grade of 0.

• Feel free to talk to other members of the class in doing the homework. You should, however, write down your solution yourself. Please try to keep the solution brief and clear.

• The programming assignment is to be done in Python. Only standard Python li- braries are to be used for this homework. For the programming assignment, in addition to the results (see below), you need to turn in a short report describing what you did, what were the difficulties, and what were your conclusions.

1. [Movie review classification using Na¨ıve Bayes - 10 points]

Assume that you have trained a Na¨ıve Bayes classifier for the task of sentiment classification (please refer to Chapter 4 in the J&M book). The classifier uses only bag-of-word features. Assume the following parameters for each word being part of a positive or negative movie review, and the prior probabilities are 0.4 for the positive class and 0.6 for the negative class.

pos

neg

I always like foreign

films

0.09 0.07 0.29 0.04 0.08

0.16 0.06 0.06 0.15 0.11

Question: What class will Na¨ıve Bayes assign to the sentence “I always like foreign films”? Show your work.

2. [Implementing the Na¨ıve Bayes classifier for movie review classification – 90 points] In this assignment, you will write 2 scripts: NB.py and pre-process.py. NB.py should take the following parameters: the training file, the test file, the file where the parameters of the resulting model will be saved, and the output file where you will write predictions made by the classifier on the test data (one example per line). The last line in the output file should list the overall accuracy of the classifier on the test data. The training and the test files should have the following format: one example per line; each line corresponds to an example; first column is the label, and the other columns are feature values.

pre-process.py should take the training (or test) directory containing movie re- views, should perform pre-processing1 on each file and output the files in the vector format to be used by NB.py.

a) Implement in Python a Na¨ıve Bayes classifier with bag-of-word (BOW) fea- tures and Add-one smoothing. Note: Do not use smoothing for the prior parameters. You should implement the algorithm from scratch and should not use off-the-shelf software. [35 points]

b) Use the following small corpus of movie reviews to train your classifier. Save the parameters of your model in a file called movie-review-small.NB (you can manually convert this small corpus into the vector format, so that you can run NB.py on it). [10 points]

i. fun, couple, love, love comedy

ii. fast, furious, shoot action

iii. couple, fly, fast, fun, fun comedy

iv. furious, shoot, shoot, fun action

v. fly, fast, shoot, love action

c) Test you classifier on the new document below: {fast, couple, shoot, fly} . Compute the most likely class. Report the probabilities for each class. [5 points]

d) Now use the movie review dataset provided with this homework to train a Naive Bayes classifier for the real task. You will train your classifier on the training data and will test it on the test data. The dataset contains movie reviews; each review is saved as a separate file in the folder “neg” or “pos” (which are located in “train” and “test” folders, respectively). You should use these raw files and represent each review using a vector of bag-of-word features, where each feature corresponds to a word from the vocabulary file (also provided), and the value of the feature is the count of that word in the review file.

Pre-processing : prior to building feature vectors, you should separate punc- tuation from words and lowercase the words in the reviews. You will train NB classifier on the training partition using the BOW features (use add-one smoothing, as we did in class). You will evaluate your classifier on the test partition. In addition to BOW features, you should experiment with addi- tional features. In that case, please provide a description of the features in your report. Save the parameters of your BOW model in a file called movie- review-BOW.NB. Report the accuracy of your program on the test data with BOW features.

Investigate your results. For the reviews for which your program made incor- rect predictions, were there any trends that you observed? That is, can you explain why these incorrect predictions were made? [40 points]

Submission

Important: Submissions that do not follow the submission guidelines will not receive full credit. Submissions that do not attach the report as a separate attachment will lose 20 points. Submissions that do not have a report receive 0 credit. Submissions that do not include the code or provide a non-working link will lose 100 points. Please include your report as a separate attachment in the email.