CMP-7023B Data Mining 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CMP-7023B Data Mining
Data Mining Second Assessed Excrcise
Assessment elements/marks for the written report (to be completed by the marker)
Elements/Criteria and marks |
Comments |
Mark |
Part 1: Summary of features (10%) |
Only the size of the dataset is given in the summary, you should have a description of the dataset and some summary tables giving number of features of each type, proportion of missing data etc. |
3 |
Part 2: Data pre-processing Stages (25%) |
Some features such as encounter_id and hospital_id were correctly removed, but so were gender and ethnicity which can be medically significant and so should have been retained. It is difficult to understand what you are describing under 'Removal of missing values..' — you appear to be writing about removing features with missing values but also write about removing dupicated data. You need to be more clear in your descriptions. In particular you should state how many features or samples have been removed at each stage of preprocessing. Class balancing was performed on the whole dataset, but it should only be applied to the training data. The number of samples of each class (diabetic/non-diabetic) was stated to be less than 200 by this point. Assuming this is correct then something has gone very wrong with the preprocessing/cleansing, but because the submitted jupyter notebook was unopenable, I was unable to look at your code. |
6 |
Part 3: Supervised Model Training and Evaluation - (35%) |
This section is very confusing. It presents results from a number of classifiers giving accuracies in the 80% range with a discussion that does not always make sense; then under model evaluation presents the best classifier as having an accuracy of 45%. |
14 |
Part 4: Un-supervised Clustering (15%) |
Unsupervised learning was applied, but there is almost no description of what was done, and the presented figure has no caption and is not discussed. The discussion in this sectin appears to be a series of parphrased quotations and overall the section makes little sense. |
3 |
Overall Presentation, Conclusions (15%) |
I was unable to open the jupyter notebook submitted, so I could not look at the code. Parts of this report are only semi-comprehensible, some expressions used such as 'Forest at spontaneous', 'arbitrary forest' and 'k-next-door neighbor' classifiers appear to be paraphrased from other sources. |
6 |
Overall Comment |
|
32 |
2022-07-20
Data Mining Second Assessed Excrcise