闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

FIT1043 Assignment 2: Specification

Due date: Monday 2nd October 2023 - 11:55 pm

Aim

The main objective of Assignment 2 is to conduct predictive analytics, by building predictive models on a dataset using Python in the Jupyter Notebook environment.

This assignment will test your ability to:

● Read and describe the data using basic statistics,

● Split the dataset into training and testing,

● Conduct multi-class classification using Support Vector Machine (SVM)**,

● Evaluate and compare predictive models,

● Explore different datasets and select a particular dataset that meets certain criteria

● Deal with missing data,

● Conduct clustering using k-means

** Not taught in this unit, you are to explore and elaborate these in your report submission. This will be a mild introduction to life-long learning to learn by yourself.

Data

We will explore the following datasets in Part A (plus a dataset of your choice in Part B):

1. FIT1043-Essay-Features.csv

2. FIT1043-Essay-Features-Submission.csv

Format: each file is a single comma separated (CSV) file

Description: These two datasets were derived from a set of essays and are used to describe the essay features in numeric information.

Columns:

Column’s header Description
essayid	a unique id to identify the essay
chars	number of characters in the essay, including spaces
words	number of words in the essay
commas	number of commas in the essay
apostrophes	number of apostrophes in the essay
punctuations	number of punctuations (other than commas, apostrophes, period, questions marks in the essay
avg_word_length	the average length of the words in the essay
sentences	number of sentences in the essay, determined by the period (fullstops)
questions	number of questions in the essay, determined by the question marks

avg_word_senten ce	the average number of words in a sentence in the essay
POS	total number of Part-of-Speech discovered
POS/total_words	fraction of the POS in the total number of words in the essay
prompt_words	words that are related to the essay topic
prompt_words/to tal_words	fraction of the prompt words in the total number of words in the essay
synonym_words	words that are synonymous
synonym_words/ total_words	fraction of the synonymous words in the total number of words in the essay
unstemmed	number of words that were not stemmed in the essay
stemmed	number of words that were stemmed (cut to the based word) in the essay
score	the rating grade, ranging from 1 – 6

This data is pre-processed data on a set of essays that were provided on Kaggle. You DO NOT have to download or process/wrangle the data from the original source.

Hand-in Requirements

Please hand in the following 4 files (including a PDF file containing your code, answers and explanations to questions, a Jupyter notebook file (.ipynb) containing your Python code to all the questions, CSV file for your prediction in task A4 and the video file respectively):

● The PDF file should contain:

o Answers and explanations to the questions. Make sure to include screenshots/images of the graphs you generat. Also, copy/paste your Python code to justify your answers for all the questions.

o You can use Microsoft Word or other word processing software to format your submission. Alternatively, generate your PDF from your jupyter notebook formatted using markdown. Either way save the final copy to a PDF before submitting.

● The .ipynb file should contain:

o A copy of your work using python code to answer all the questions.

● The video file should contain:

o A recording of yourself, explaining your answers to Part B.

o You can use Zoom to prepare your recording.

o Note each student is required to explain their approach in Part B only. Please see

Note: Zip, rar or any other similar file compression format is not acceptable and will have a penalty of 10%.

Assignment Tasks:

Note: You need to use Python to complete all tasks.

Part A: Classification

A1. Supervised Learning

1. Explain supervised machine learning, the notion of labelled data, and train and test datasets.

2. Read the ‘ FIT1043-Essay-Features.csv’ file and separate the features and the label (Hint: the label, in this case, is the ‘score’)

3. Use the sklearn.model_selection.train_test_split function to split your data for training and testing.

A2. Classification (training)

1. Explain the difference between binary and multi-class classification.

2. In preparation for classification, your data should be normalised/scaled.

a. Describe what you understand from this need to normalise data (this is in your Week 7 applied session).

b. Choose and use the appropriate normalisation functions available in sklearn.preprocessing and scale the data appropriately.

3. Use the Support Vector Machine algorithm to build the model.

a. Describe SVM. Again, this is not in your lecture content, you need to do some self-learning.

b. In SVM, there is something called the kernel. Explain what you understand from it.

c. Write the code to build a predictive SVM model using your training dataset.

(Note: You are allowed to engineer or remove features as you deem appropriate)

4. Repeat Task A2.3.c by using another classification algorithm such as Decision Tree or Random Forest algorithms instead of SVM.

A3. Classification (prediction)

1. Using the testing dataset you created in Task A1.3 above, conduct the prediction for the ‘score’ (label) using the two models built by SVM and your other classification algorithm in A.2.4.

2. Display the confusion matrices for both models (it should look like a 6x6 matrix). Unlike the lectures, where it is just a 2x2, you are now introduced to a multi-class classification problem setting.

3. Compare the performance of SVM and your other classifier and provide your justification of which one performed better.

A4. Independent evaluation (Competition )

1. Read the ‘ FIT1043-Essay-Features-Submission.csv’ file and use the best model you built earlier to predict the ‘score’ for the essays in this file.

2. Unlike the previous section in which you have a testing dataset where you know the ‘score’ and will be able to test for the accuracy, in this part, you don’thave a ‘score’ and you have to predict it and submit the predictions along with other required submission files.

a. Output of your predictions should be submitted in a CSV file format. It should contain 2 columns: ‘essayid’ and ‘score’. It should have a total of 200 lines (1 header, and 199 entries).

Part B: Selection of Dataset, Clustering and Video Preparation

B1. Selection of a Dataset with missing data, Clustering

We have demonstrated a k-means clustering algorithm in week 7. Your task in this part is to find an interesting dataset and apply k-means clustering on it using Python. For instance, Kaggle is a private company which runs data science competitions and provides a list of their publicly available datasets: https://www.kaggle.com/datasets

1. Select a suitable dataset that contains some missing data and at least two numerical features. Please note you cannot use the same data set used in the applied sessions/lectures in this unit. Please include a link to your dataset in your report. You may wish to:

● provide the direct link to the public dataset from the internet, or

● place the data file in your Monash student - google drive and provide its link in the submission.

2. Perform wrangling on the dataset to handle the missing data and explain your procedure

3. Perform k-means clustering, choosing two numerical features in your dataset, and apply k-means clustering to your data to create k clusters in Python (k>=2)

4. Visualise the data as well as the results of the k-means clustering, and describe your findings about the identified clusters.

B2. Video Preparation

Presentation is one of the important steps in a data science process. In this task you will need to prepare a short video of yourself (you can share your code on screen) and

describe your approach on the above task (Task B1).

● Please make sure to keep your camera on (show yourself) during recording. You may want to share your screen with your code while you talk.)

2023-09-28

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths