闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

SCC403 – Data Mining

Coursework 2 Assignment

1 Introduction

The objective of the assignment is to conduct data analysis with the two real life data sets that were used in Course Work 1, namely clustering the climate data and classiﬁcation of the data extracted from the video stream. This includes selection and justiﬁcation of the speciﬁc methods for clustering and classiﬁcation, their implementation and analysis of the results and well annotated code. It is expected that your analysis should include:

● clustering the climate data;

● classiﬁcation of diﬀerent objects detected from the video stream.

All choices must be justiﬁed through analysis and comparison. Analysis and understanding of the methods, algorithms and the overall process are the most important elements in addition to the implementation skills (code, presentation) and the results. You are expected to critically analyse the results of applying these techniques, and demonstrate a clear understanding of the purpose and processes of data analysis.

In addition to your report, including comments, plots, and an analysis of the results, please also submit your well annotated source code.

We expect the use of Python - the most widely used language for machine learning which we also use in the labs, but if you prefer to use a diﬀerent language we may need to contact you for clariﬁcation, if we we believe that your code is not running correctly.

2 Tasks description

2.1 Data sets

You are expected to use the two sets of data used in the Coursework Assignment 1, namely:

● the climate data provided in the ﬁle y ClimateDataBasel.csvy ;

● the video provided in the y OriginalVideoStream.m4vy and its binary version with detected foreground y BinaryVideo.aviy and the foreground objects detected as discussed in Course Work 1 and stored in the ﬁle y WLA.csvy as well as the labels from the ﬁle y Labels.csvy )

2.2 Clustering

Choose at least two clustering algorithms and apply them to the climate data set. To achieve top marks one of the methods should be from independent research.

Develop the programme and explain the functionality of the algorithms in as much detail as you can. Compare the results and limitations of each of the algorithms that you have used.

2.3 Classiﬁcation

This task applies to the second data (the video stream). Train at least two classiﬁers of your choice on a part of the data (you may choose what proportion of the data to use for training and what proportion for testing/validation), perform cross-validation and evaluate the performance of the classiﬁers and report this.

Hints:

1. The ﬁrst 16 lines of the ﬁles y WLA.csvy and y Labels.csvy contain only one of the two objects of interest - only the motorbike - and, therefore, perhaps these 16 cases will not be very useful for training.

2. The remaining 172 lines of the ﬁles y WLA.csvy and y Labels.csvy has to be considered in pairs (86 pairs)

3. As you probably realised from your Course Work 1 it may not be the best option to use all three values from the ﬁle y WLA.csvy . Why?

When analysing the performance of the classiﬁers you should use precision/recall, F1 score and classiﬁcation accuracy. You may also indicate the time required for training the classiﬁer as a measure of computational complexity (note that the time is always conditional on the type of hardware you use - laptop, computer, CPU/GPU, etc.) and is not an absolute measure, but when making comparisons it can be useful.

The deadline for submission is: 4pm, 17 December 2021, Friday. The cut-oﬀ deadline is 4pm, 20 December 2021, Monday (with late submission penalty incurred which is 1 letter grade or 10%). Submissions after this deadline cannot be accepted according to the University regulations.

3 Marking Scheme

For the overall report marks are allocated as follows:

● Structure and presentation (5%)

● Language and style (4%)

● Use of literature and references (5%)

Each of the two parts (clustering, and classiﬁcation) will be marked as follows:

● Level of understanding (9%)

● Depth of analysis (8%)

● Justiﬁcation of selected methods (8%)

● Independent research and use of methods not given in the lectures (9%)

● Working, well annotated code and results (9%)

At the end of this document there is an Appendix, which explains what a mark means in Lancaster University and includes suggestions for a well-written report.

The length of the report should not exceed 5 pages. You can use double column format, e.g. the so-called IEEE style as described in the Appendix. You may include an Appendix (2 pages maximum) after the main report.

4 Additional Comments

You must report in an “acknowledgements” section the use of any libraries, readily available online code, and code from online tutorials. Additionally, you are free to discuss your work with colleagues, but you must also report in the “acknowledgments” section if anyone has helped you signiﬁcantly. Remember that using others’ work without giving the due credit is an act of plagiarism, and it is not a good academic practice.