Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Module: CMP-7023B - Data Mining

Assignment: Data Mining the Healthcare Dataset

Date set: Week 6

Value: 65%

Date due: Wednesday 08/05/2024 3 pm [week 11]

Returned by: Assessment Period Week 2

Submission: Blackboard (A Turnitin point will be provided in Blackboard)

Learning outcomes

.    Competence in using KDD software tools in medium to large databases.

.    Competence in applying relevant techniques at each stage of the KDD process.

.    Ability to evaluate the suitability of software tools in the context of different data analysis tasks.

.    Competence in combining data manipulation and analysis approaches to improve the quality of input data.

.    Understanding and identification of problems in input data such as outliers, missing data, unreliable data, differences in granularity, and others, and identifying an

adequate strategy to deal with the problem data.

.    Presentation of knowledge induced in a format suitable for the target audience and for the particular application.

Specification

Overview

Aim:

.    To obtain an overall view of the complex process of Knowledge Discovery in

Databases and understand the need for a methodical approach to KDD.

.    To explore tools and algorithms available at each stage of the KDD process.

.    To gain experience in using KDD software tools in a medium-sized database.

.    To learn to combine data manipulation and analysis approaches to improve the quality of input data.

.    To produce a suitable report describing the methods applied and the discussion of the findings.

Coursework Description

See attached at page 4.

Relationship to formative assessment

Lecture slides, lab exercises and links to resources provide the baseline to design experiments with data mining techniques applied to real data.

Deliverables and Handing in Procedure

Document Format:

Your report should be typed. We recommend using LaTeX for efficient reference and cross- reference management (LaTex Template available on Blackboard). However, if you choose  not to use LaTeX, please ensure the use of a 12-point standard font like Arial or Comic Sans with a standard page layout, including margins.

1.   Report Submission and code (Approx 95%):

.   Collate all responses to the provided questions into a report.

.   The report should follow the structure/sections according to the components of the marking scheme and must not exceed 15 pages including a bibliography and references (appendix excluded from page counts).

.   Include a clear and concise abstract in the report summarizing your findings.

.   Append the cleaned code/notebook, produced to accomplish your tasks, in an appendix to the report.

.   Submit the entire report as a single PDF file to the Turnitin submission point on Blackboard {002} {CMP-7023B-22-SEM2-B}.

.   Ensure you have successfully submitted the report to the Turnitin point and received proof of submission.

2.   Submission of predicted labels (Approx 5%):

.   Submit a separate file containing the predicted labels generated by your best classification model for the test set 'disease_test.csv.

.   Save this information in the “predictedTarget.csv” file.

.   Copy the predicted labels (single column) of your best model to the second column labelled “predicted_target” in the “predictedTarget.csv” file.

.   Ensure NOT to change or overwrite any of the file name, column names, the order of patient identifiers, and data within the “predictedTarget.csv” file. Specific codes are implemented during the marking process to verify your predicted labels using these details.

.   In summary, submit your “predictedTarget.csv” (with patient ID and predicted labels included) to a designated separate submission point provided on Blackboard labelled "Predicted label for your best classification model".

Resources

You can use the weekly lab documentations, lecture notes, library resources and other sources to accomplish your tasks. Do not forget to cite any external and online resources used. Students are expected to work independently, and any plagiarism or collusion will be heavily penalized.

Marking scheme

Assessment criteria marks distributed as follows:

Marks Approx.

Data exploration, visualisation, and summary

10%

Data Cleaning and pre-processing

20%

Supervised model training, tuning and evaluation, including the predicted labels, result interpretation and comparative analysis

35%

Unsupervised learning using clustering algorithms, including result interpretation and comparative analysis

15%

Overall presentation encompassing references, cross-references, interpretation, comparative analysis, and conclusions of the report.

20%

100%

Plagiarism, collusion, and contract cheating

The University takes academic integrity very seriously. You must not commit plagiarism,

collusion, or contract cheating in your submitted work. Our Policy on Plagiarism, Collusion, and Contract Cheating explains:

.    what is meant by the terms ‘plagiarism’, ‘collusion’, and ‘contract cheating’

.    how to avoid plagiarism, collusion, and contract cheating

.    using a proof reader

.    what will happen if we suspect that you have breached the policy.

It is essential that you read this policy and you undertake (or refresh your memory of) our school’s training on this. You can find the policy and related guidance here:

https://my.uea.ac.uk/departments/learning-and-teaching/students/academic- cycle/regulations-and-discipline/plagiarism-awareness

The policy allows us to make some rules specific to this assessment. Note that:

In this assessment, working with others is not permitted. All aspects of your

submission, including but not limited to: research, design, development and writing, must be your own work according to your own understanding of topics. Please pay   careful attention to the definitions of contract cheating, plagiarism and collusion in   the policy and ask your module organiser if you are unsure about anything.

CMP-7023B - Data Mining

Second Assessed Exercise

Data Mining the Healthcare Dataset

Date due: Wednesday 08/05/2024 3 pm [week 11] Value: 65%

Exercise Description

In healthcare, the development of a model with precise capabilities to identify risk

conditions for specific diseases holds significant potential for influencing decisions about patient care.

To complete this coursework, you will be working with a healthcare dataset focusing on a specific type of disease.  A copy of the dataset is uploaded on Blackboard as ‘disease_train.csv’ and ‘disease_test.csv’, which contain anonymized patient-related data.

The ‘disease_train.csv’ file has 4250 observations and 24 variables, including the target variable. In the given data file, there are various attributes related to patient status such as id, age, gender, sickness, pregnancy, tumor presence, surgery, and more (refer to page 6   for a simple description of dataset features).

Your primary tasks involve accurately classifying and clustering patients considering their health status using the provided attributes. Subsequently, you are required to report and interpret your findings.

To accomplish your task, you need to perform the following operations:

1.   Data Exploration, Visualisations, and Summary:

.   Download the dataset, prepare a summary of features, including data type

(numerical/categorical), and assess the amount of missing data and outliers in  individual features. Conduct initial exploration with visualisation and statistical analysis of the features.

.   Introduce the dataset in your report including any interesting visualisations.

2.   Data Cleansing and Pre-processing:

.   Undertake any cleansing or pre-processing you think is necessary on the dataset.

.   In your report, explain clearly what you have done and why you have done it.

Some cleaning could be to remove any feature/column if 60% values are missing, constant, or to remove duplicate and highly correlated information.

3.   Supervised Model Training, Tuning, and Evaluation:

.   Split the data into a training set and a validation set once cleansing is done.

.   Use suitable toolkits and libraries to train models (e.g. k-NN, Decision Tree,

Random Forest, SVM, or more sophisticated models …) from the training set to build the health condition Classifier. Make sure you perform adequate parameter tuning for each model.

.   Experiment with balancing the data, feature selection, and other

adjustments/tuning to enhance model quality.

.   Evaluate the performance of the models on the test/validation set.

.   Use appropriate tools to clearly illustrate and identify the features that were

deemed most crucial or had the most significant impact on decision-making within the best-performing model.

.   Produce the predicted label for your best classification model on the separate test set that you have been given, this must be submitted in a separate csv file

(predictedTarget.csv). This must be labelled as Low risk, Moderate risk, High risk NOT other formats.

.   Present comparative analysis and interpretation of results:

-     In your final report, describe and justify the decisions made during data processing.

-     Present and discuss the results, including model validation/evaluation techniques.

-     Discuss the effectiveness of the best model using metrics such as confusion matrices, ROC curve, precision, and recall performance.

-     Ensure you include comparative analyses for your trained classifiers and if you have multiple pre-processing techniques for each classifier.

4.   Unsupervised Learning Using Clustering Algorithms:

.   Exclude the medical diagnosis (target) field during clustering to ensure unsupervised learning.

.   Apply unsupervised clustering algorithms (e.g. k-Means, hierarchical clustering, …) to the pre-processed dataset based on your judgment of the task.

.   Use Scatter plots or t-SNE plots on the clusters to visually represent the data.

.   Analyse the formed clusters, observing distinct patterns or groups related to

different categories of health status (Low risk, Moderate risk, High risk).

.   Discuss the effectiveness of the clustering models.

.   Present comparative analysis and interpretation of results.

.   Compare the clustering and classification outcomes and discuss your observations, i.e. do the clustering results align with the classification outcomes?

Dataset Features:

-     'id': Unique patient ID.

-     'age': Age of the patient.

-     'gender': Gender of the patient.

-     'sick': Is the patient currently sick?

-     'pregnant': Is the patient currently pregnant?

-     'test_X1' to 'test_X6': Related to various medical tests.

-     'concern_type1' and 'concern_type2': Related to concerns.

-     'enlargement': Indicating enlargement.

-     'tumor': Does the patient have a tumor?

-     'disorder': Indicating a certain gland disorder.

-     'medication_A': Is the patient on medication A?

-     'medication_B': Is the patient on medication B?

-     'mental_health': Is the patient undergoing psychiatric evaluation?

-     'mood_stabiliser': Related to mood stabilization medication.

-     'surgery': Has the patient undergone surgery?

-     'treatment_type1': Is the patient undergoing treatment A?

-     'suspect': Does the patient suspect disease?

-     'target': Medical diagnosis (target variable/label).