INF6028 Coursework 2022/23 Mining and Evaluating a Structured Dataset

发布时间：2024-05-14

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INF6028 Coursework 2022/23

Mining and Evaluating a Structured Dataset

1. Introduction

The assessment for INF6028 Data Mining consists of a single piece of individual coursework to assess your ability to understand key data mining, analysis and evaluation concepts. You will be assigned a single dataset with an associated data mining problem to solve (e.g., aregression problem). You should first use data exploration techniques to explore the data, conduct appropriate data preparation, and then choose two supervised data mining techniques available in KNIME to predict certain data values and evaluate and compare their performance. You will need to select appropriate techniques, justify your choices made at different stages of your workflow, and demonstrate that you have knowledge of the necessary underlying data mining techniques.

You should write a 2,500 word structured report (see Section 3) that includes the following headings (more details on how the report will be assessed are provided below):

. Introduction - introduce the prediction problem.

. Data mining theory - provide a theoretical description of the two supervised data mining methods

used in the workflow (for example, the classification or regression techniques that have been used), why they are appropriate to the prediction task, and how their performance can be assessed. This should include citations to relevant prior literature.

. Data exploration and preparation - describe the approaches used in the workflow to explore the data; and perform featureselection, transformation and normalisation, where appropriate.

. Experimental setup - describe the experimental setup and the evaluation measures used in the

workflow and how the data has been handled to ensure that the models were not over-fitted. You should explain which nodes were used in KNIME and provide a rationale for the various parameter settings that were used. You should not, however, simply list all the modules in your workflow and their parameters - be selective and discuss the modules most critical to solving the data mining task.

. Results - present the results for each data mining method and compare the performance of the different methods using graphical and tabular methods. What insights can you gain from the

models? For example, which are the most important features, are there any outliers in the predictions?

. Conclusion and reflections - summarise the main findings of your report and reflect on the methods used.

Charts and tables (and their associated captions), references and appendices are not included in the word count.

Remember: your report should be a critical evaluation of the workflow in the context of the data mining problem posed, it should not be merely a description of what was done.

This assessment is worth 100% of the overall module mark for INF6028. A passmark of 50 is required to pass the module. Submission deadline: 1st of June at 4 PM, via Turnitin. See Section 4 for more general information about Coursework Submission Requirements within the Information School.

2. The Datasets

You will choose a single dataset to base your analyses and report on. Please choose one of the two datasets below and ensure before you start working on the assessment that you are using the correct dataset.

The datasets have been derived from Kaggle competitions and are downloadable from Blackboard in the Assessment section. A brief description of the attributes in each dataset is given at the end of this document.

Note that in both cases the data are different to the standard Kaggle datasets – they have been extensively modified for this year’s run of INF6028. Do not attempt to use the datasets from Kaggle or to use/copy any of the workbooks available there – this would constitute unfair means

Titanic Dataset (Binary Classification)

The data is split across two files, each of which contains 1,204 entries representing 1,204 passengers, although it should be noted that the passengers are not necessarily the same in the two files. The two files are titanic_ticket_data.csv and titanic_personal_data.csv

The aim of this challenge is to build a model that is able to predict whether or not a passenger will survive the sinking of the Titanic.

Song Popularity Dataset (Regression)

The data is split across two files, each of which contains 603 entries representing 603 popular songs from the Spotify platform. The two files are song_details.csv and song_acoustic_analysis.csv.

The aim of the challenge is to build a model to predict the popularity of each song on Spotify.

3. Report Structure

You are required to produce a structured report that includes all the sections detailed in Table 1. You must state the word count somewhere in the report. As there is a word count limit you should aim to make your writing as concise and informative as possible. The emphasis of the report should be on the clarity, accuracy and quality in communicating your findings. Where helpful, you may wish to state specifically which KNIME nodes you have used but you should avoid simply listing nodes used and their settings - be selective.

Table 1: Required content of the structured report.

Section	Description	Maximum allocated marks
Structured abstract	This should provide a summary of your report in a structured manner. This is not included in the word count.	Required, but 0 marks
Introduction	This section should introduce the data mining task that is addressed in the report. You should indicate the property/data value that is predicted and give a brief overview of the dataset and methods used.	10 marks

Data Mining Theory	This section should provide an overview of the algorithms for predictive data mining used in the workflow from a theoretical aspect. Explain why they are relevant to the prediction problem. Support your rationale by providing references to the literature where the techniques have been applied to similar problems. Include a short discussion of the most appropriate methods for evaluating the performance of these data mining methods.	25 marks
Data Exploration and Preparation	This section should provide a brief description of the data and of the approaches used to pre- process the data. You should present an investigation of the attributes (including the data value to be predicted) and describe any data cleaning employed, including handling of missing data, data transformations and data aggregations.	10 marks
Experimental Setup	This section should describe the experimental design in the workflow. You should describe the process followed in order to find the best performing model for each method and how this was validated. For example, which KNIME nodes were used? How were they configured? Was any cross- validation or a separate validation set used and why?	20 marks
Results and Discussion	Present the results of the data mining process including the results of experiments to find the best model for each data mining method. Compare the best performance of the different methods and, if appropriate, consider which attribute contributes most to each model. Discuss the advantages and disadvantages of the data mining methods. Which of the chosen methods produced the best model and why?	20 marks

Conclusion and

reflections

Summarise the main findings of the analysis and reflect on the choice of methods for the

problem,for example, how might the models be improved with hindsight? Use evidence from the literature to support your arguments.

15 marks

KNIME workflow

You should submit your KNIME workflow(s) as a .knwfor .knarfile. Note that this can consist of separate workflows but they should all be saved to one file. Include your best setup for each data mining method.

Required, but 0 marks.

Note that 5 marks will be deducted if this is not submitted and it may make it difficult for your marker to assess your work.