闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP34711 Natural Language Processing

Coursework 2

You are provided with the product review corpus. Check the README file and observe the content, format and structure of the corpus. You are asked to design and evaluate solutions for two NLP tasks using this corpus. You are only free to use functions that are available in the NLTK framework and machine learning libraries specified in the instruction, e.g., Weka, scikit-learn, PyTorch (above version 1) and TensorFlow (below version 2), to implement your design. Overall, this coursework is marked on the basis of

• rigorous experimentation,

• knowledge displayed in report,

• independent problem-solving skill,

• self-learning ability,

• how informative your analysis is,

• language and ease of reading of the report,

• code quality based on correctness and readability (which includes comments).

You should solve all the tasks on your own. You are not permitted to collaborate with other students on this coursework. In lab support sessions, you can ask TAs to explain knowledge taught in the lecture or seek advice on how to use a natural language processing or machine learning library. But you are not permitted to ask TAs to help with the solution design, or to check the correctness of your solution.

Your submission should include both code and report. About your code, provide comments when you see fit and your code will be marked based on both correctness and readability (which includes comments). About your report, use Arial Font 11. Your main report should be no more than 3 pages, including up to 2 pages for Task 1 while up to 1 page for Task 2. If needed, you can include additionally up to 2 pages of screenshots (e.g., of your results) as an Appendix of your report.

Task 1: Distributional Semantics (15 marks)

The following experiment is designed to evaluate the performance of a distributional semantic approach.

• Step 1: Clean and pre-process all reviews in your text corpus as you see fit. Choose the top 50 most frequently occurred words (after removing the stop words) as the target words. You are free to use functions that are available in the NLTK framework to help your text pre-processing.

• Step 2: For each of the 50 target words, uniformly sample half of its occurrences in the corpus and substitute these with a made-up reverse words, e.g., half of the occurrences of "canon'' will be transformed into "nonac". Refer to these 50 new words as pseudowords.

• Step 3: Construct a d-dimensional feature vector to characterise each of the 50 target words and 50 pseudowords (N=50+50=100) using a distributional semantic approach (more detailed requirements on this are provided later). Store your obtained feature vectors in a 100´d matrix X.

• Step 4: Take the feature matrix X as the input, and apply a clustering algorithm to cluster the

clustering algorithm implementation as you see fit. For instance, clustering modules (https://www.nltk.org/api/nltk.cluster.html) from NLTK, machine-learning framework for clustering from Weka (http://www.cs.waikato.ac.nz/ml/weka/) and scikit-learn (https://scikit- learn.org/stable/modules/clustering.html#clustering).

• Step 5: For each pair of the target word and its corresponding pseudoword, if these two are grouped into the same cluster, it is defined as a correct pair. Among the 50 pairs, check the percentage of the correct pairs, denoted by p.

• Step 6: Repeat this whole process multiple times, e.g., 5-10, and calculate the mean and standard deviation of the obtained percentages p.

Applying what you have learned on lexical processing and distributional semantics, you should come up with 2 different approaches for constructing the distributional semantic representations. For instance, they can differ in ways of constructing the dictionary (e.g., stems vs. words) and of extracting the context features, or differ in the approach principles. You should aim at achieving a good clustering performance and understanding the reason behind.

Here are the requirements of the 2 approaches:

• They should differ significantly. For instance, the same context feature extraction approach with different window sizes is considered as one approach.

• They should include one sparse approach and one dense approach.

• They should be evaluated and compared thoroughly, e.g., their performance, and effect of their hyperparameter setting.

1. Submission Instruction

Your implementation should be well-structured, defining a function for each step and executing the functions in a main file.

You should submit the implementation and evaluation of your 2 approaches as 2 separate Jupyter notebook files, named as “Task1_Approach1”, “Task1_Approach2” . The TA will run each file separately during marking.

You should prepare a report (up to 2 pages) containing two sections:

• Methods: Explanation of your text cleaning and pre-processing steps, as well as the 2 approaches for constructing the distributional semantic representations.

• Result Analysis: Analyse and discuss the obtained clustering results for each approach. You should discuss hyperparameter relevant issues if your approach requires any hyperparameter setting, e.g., setting context window size, determining feature dimensionality d for a dense approach, etc.

2. Mark Allocation

Marks are allocated as below:

• 1 mark for text cleaning, pre-processing, target words selection, pseudo words construction.

• 10 marks for implementation, description, and result analysis of the 2 approaches, where 5 marks for each approach.

• 2 marks for clustering performance award, which means to achieve a satisfactory clustering performance exceeding a percentage threshold by at least one approach and for explaining the reason behind your success. This percentage threshold will only be released to you after the marking.

• 2 marks for design novelty on an approach to construct the distributional semantic representations. This can be either an improvement of what has been taught or a new reasonable approach not taught in the “Distributional Semantics” Chapter. You need to highlight in the report what the novelty is, if to gain these marks.

Task 2: Neural Network for Classifying Product Reviews (10 marks)

The product review corpus contains reviews scored as positive and negative opinions. Pre-process your text, prepare the review examples for training and evaluation. Implement, train and evaluate a neural network that can classify an input review to either a positive or a negative class. You are free to choose any neural network/deep learning technique taught in the Chapter “Deep Learning for NLP” , e.g., multi-layer perceptron, LSTM, bi-directional LSTM, etc. You should evaluate your classifier’s classification accuracy using 5-fold cross validation (CV). You can use either PyTorch (above version 1) or TensorFlow (below version 2) library.

1. Submission Instruction

Your implementation should be well-structured with comments. You should submit the implementation and its evaluation as a single file, named as “Task2”.

Prepare a report (up to 1 page) containing 2 short sections:

• Method: Explanation of your classification model design and training.

• Experiment and Result Analysis: Describe your experiment and evaluation approach. You should discuss hyperparameter relevant issues if your approach requires any hyperparameter setting. Report and analyse classification accuracy.

2. Mark Allocation

Marks are allocated as below:

• 2 marks for text cleaning, pre-processing, and preparing the input data for the classifier.

• 7 marks for implementation, classification accuracy evaluation by 5-fold cross validation, method description, and result analysis of the classifier.

• 1 mark for classification accuracy award, which is to achieve a satisfactory classification

accuracy exceeding an accuracy threshold. This threshold will only be released to you after the marking.

-----------------------------

Submission Checklist

A .zip file named as “34711-Cwk-S-DeepLearning” containing

• Three code files: Task1_Approach1, Task1_Approach2, Task2.

• One .pdf file, combining reports for both Task 1 and Task 2.