Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CMP-7023B Data Mining

The Diabetes Mellitus Dataset-Reassessment Task

2022

Learning outcomes

Competence in using KDD software tools in medium to large databases.

Competence in applying relevant techniques at each stage of the KDD process

Ability to evaluate the suitability of software tools in the context of different data analysis tasks.

Competence in combining data manipulation and analysis approaches to improve the quality of input data.

Understanding and identification of problems in input data such as outliers, missing data, unreliable data, differences in granularity, and others, and identify an adequate strategy to deal with the problem data.

Presentation of knowledge induced in a format suitable for the target audience and for the particular application.

Specification

Overview

Aim

To obtain an overall view of the complex process of Knowledge Discovery in Databases and understand the need for a methodical approach to KDD.

To explore tools and algorithms available to each stage of the KDD process.

To gain experience of using KDD software tools in a medium sized database.

To learn to combine data manipulation and analysis approaches to improve the quality of input data.

To produce a suitable report describing the methods applied and the discussion of the findings

Description

To complete this reassessment coursework, you will be using the same patient dataset you used for the original coursework. Your task would be to predict whether the patient has been diagnosed with a particular type of diabetes, Diabetes Mellitus , using the data from the first 24 hours of intensive care. A curated version of the dataset is available on Blackboard as ‘DiabetesClassificationDataset2022.csv’ .

The file has 79,160 observations and 87 variables (memory usage: 30+ MB). If your computer has memory restrictions feel free to complete the experiment with a smaller sample of the provided data.

In the given data file, there are various information related to patient status in the ICU (demographics such as age, weight, BMI etc; APACHE-Acute Physiology and Chronic Health Evaluation covariates) and other related comorbidities; vital and laboratory test results collected within 24h of admission are provided. A further description of the fields can be found in the Data Dictionary for the dataset. Your task is to accurately classify the diabetes_mellitus status of the patient from the given fields and report back on your findings. Intensive Care Units (ICUs) often lack verified medical histories for incoming patients and a model with the accurate capability to indicate chronic

conditions such as diabetes can help decisions about patient care.       To accomplish your task, you need to perform the following operations:

1.  Give a short introduction to the dataset and prepare a summary of the features available on the dataset including data type (numerical/ categorical), count plots for diabetic and non-diabetic people, amount of missing data in individual fields.