Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP723 Data Mining and Knowledge Engineering

Assignment 2 – Text Classification (S2, 2022)

Assignment Details Summary

Due Date: 28 October, midnight (Friday of week 13) Assignment group: To be done as a Pair Submission: One copy from a pair, via Canvas.

Assignment Weight: 50%

Objective

To develop a broad understanding of text mining by performing a representative task, Text Classification.

Task Specification

The assignment requires classification of “free text” snippets into categories using machine learning algorithms. You are required to use three different classification algorithms on the given two datasets, analyse the results, and present a report of your findings.

3.1 Due Dates and Submission

· This assignment should be done in pairs. If you are an online student, you should contact other students via the canvas discussion board or other means to find a partner to work with.

· Only one person from the pair should submit the assignment. The report should clearly state the name and student ID of both members of the team. Furthermore, the contributions made by each team member must be clearly stated in the section

“Contributions – group name” where the group name is “surname1/surname2”, at the beginning of the report.

· You are required to submit only an electronic copy of the assignment via the Turnitin assignment Submission tab (in the assignments tab) on the course page in Canvas.

3.2 Marking

· This assignment will be marked out of 100 marks and is worth 50% of the overall mark for the paper.

Assignment Overview

The objective of the assignment is to do text classification using three machine learning algorithms on two datasets.

4.1 Dataset

The following datasets are to be used for the assignment.

1. Covid New Articles

2. Fake and Real News

3. Emotion Detection

4. Depression: Reddit Dataset

Assignment Tasks

1. Read about the 4 datasets and choose 2 that you want to use for this assignment.

2. Either download the 2 datasets from the sites directly, or from Canvas assignment package. If the dataset doesn’t give you training and test sets, then do a stratified 80/20 train/test split of the dataset.

3. The objective of this assignment is to compare the performance of three classification algorithms for the task of text classification. You will compare two given algorithms below, and the third one will be of your choice. The two algorithms are:

a) Naïve Bayes

b) Neural Network

c) Your choice. Some examples are SVM, CRF, J45, a variation of Neural Network such as CNN, RNN, etc.

4. Implement each of the 3 algorithms to train a model with the 2 datasets as Google Colab files, a separate file for each dataset. Ensure that you thoroughly document your code. Also ensure that your datasets are hosted on a url so that I can run your code using the submitted file link.

5. Fine tune your algorithms by varying the parameter values to obtain the best possible results for each of the two datasets.

6. Tabulate your results using Precision, Recall and F1 value relative to a chosen category for the dataset.

7. Now we want to try to further improve the results by pre-processing the datasets for noise and/or features. Experiment with a combination of the pre-processing tasks, in order to build features to be used for the two machine learning algorithms. They should be consistent for the three algorithms.

Some of the pre-processing you can use are :

a) Stop word processing

b) Stemming/lemmatization

c) Feature size

d) Type of vectorisation

Generate 2 well documented Google Colab files for each of the 2 datasets, each file containing training for the 3 algorithms. Tabulate your results appropriately to communicate your results from this task.

8. For submission you should have the 4 files and names (each file containing code for 3 algorithms:

a) Dataset_”name of dataset 1”.ipynb b)Dataset_”name of dataset 2”.ipynb

c) Dataset_”name of dataset 1”_preprocessed.ipynb d)Dataset_”name of dataset 2”_preprocessed.ipynb

Written Report

· You will write a minimum of 6 and a maximum of 12 page report (excluding the references and appendix) describing the results of your experiment. You should write a conference publication style coherent paper with appropriate academic language and referencing.

· The report should include the following components.

1. Introduction

· You should first describe the task of text classification and how it is similar/different to numerical data classification tasks.

· You should then describe the purpose of the experiments, the algorithms. Include overview about the algorithms used and the main differences/similarities between them.

2. Data Description

· You should describe the datasets, the sources and all aspects of the datasets for the reader to be able to understand everything about the data.

3. Experiments

· Describe each of the experiments in detail, done as described in Section 5 above. Your description should be such that it explains the rationale for doing the tasks and should include enough detail so that the reader should be able to recreate your experiments.

4. Results and Discussion

· Include links to the code files for each of the experiments. Verify that the links work and the reader would be able to run the code by just clicking them.

· In this section appropriately tabulate all your results and discussions/analysis of all the results. Include:

a) Any high/low Precision/Recall values.

b) Variations between the algorithms.

c) Any consistent patterns in results across the algorithms and/or datasets.

d) Any aberrations in results.

e) Anything that particularly worked well or not well at all.

f) Any other results that’s worth pointing out to the reader that came out of the experiments.

g) The best algorithm and pre-processing for the datasets.

4. Conclusion and Reflections

· Describe which algorithms are appropriate for which type of datasets to get the best possible results. Which of the pre-processing tasks are effective and perhaps include computation times for training as well.

· What would you different if you were to do the whole project again. What did you do well?

Marking Scheme

The following approximate matrix would be used to grade your assignment.

Written Report

 

Formatting, Language, Presentation, References

5

Introduction

10

Data Description

10

Experiments

25

Results and Discussion

30

Conclusion and Reflections

5

Code

15

Total

100