Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Data Mining the Diabetes Mellitus Database

Executive Summary

Based on Knowledge discovery in databases (KDD) as process of discovering useful knowledge from a collection of data, this study endeavors to accurately predict the existence of diabetic mellitus disease in patience. After cleaning the data, 296 complete observations and 83 variables remained. These were then used to estimate the Random Forest (supervised model) and K-means clustering (unsupervised model). The estimated models were then compared to find the one which can better predict the patient who is diabetic. The comparison was done based on the accuracy percentage of each machine learning algorithms that could be used for predictive models. The study concludes that random forest model was found to be 45.0% accurate in estimating a person whose is diabetic while the clustering model was found to be 48.33% accurate in estimating a person whose is diabetic

Summary of features

The analysis process was started by loading all the necessary packages and data into Jupiter notebook. Knowledge Discovery in Databases (KDD) is a process that is applied for the purpose of achieving critical and valuable knowledge form a huge collection of data that had been gathered in the past. The data has been loaded, and the following is the presentation of the first five rows (5 rows × 88 columns as presented in appendix 1 and respective descripotive stasditics presented in appendix 2.

Data Cleaning/Preprocessing

The decision of using python language was based on its high level of accessibility among other available programming languages since its syntax is simple and that places more value on natural language. Since it is easy to learn and make use of, it is also easy to write and implement the python codes at a faster rate than other languages used in programming (Pehlivan et al., 2014). In addition, it provides an option for the user to use OOPs or scripting and it is also necessary to reconcile the source code (Kursa and Rudnicki, 2010). It is possible for developers to apply any change and within a short time obtain the results. NCS formative inquiry data was collected from a wide range of locations. Using a regression model in which the study site was a random effect and a linear mixed-effects model was included, the impact of the study location on participant recruitment was evaluated. Additionally, we constructed filters based on topic-eligibility criteria to exclude individuals who were pre-screened and visible in the dataset, but were not eligible. Also, keep in mind that any research in which participants are chosen at random is subject to the problem of participant bias. From the analysis output, gender, ethnicity, age among other variables were found to have a different number of missing variables.

Checking if there are any missing values in the variables

From the analysis, it was found that gender, ethnicity, age among the other variables exhibit missing values. The missing data could be brought by human errors or when the observations are not recorded in certain field for some reason.

hospital_id 0

gender 30

ethnicity 961

age 2842

elective_surgery 0

...

leukemia 0

lymphoma 0

solid_tumor_with_metastasis 0

ventilated_apache 0

diabetes_mellitus 0

From the analysis output, gender, ethnicity, age among other variables were found to have a different number of missing variables.

a) Removal of unnecessary features in the modelling process

From the data information seen earlier, encounter_id, hospital_id, gender, ethnicity and icu_type is neither float nor integer variables and hence were eliminated during model development process.

b) Removal of the missing values in the datasets.

Though K-nearest and Naïve Bayes are known to support data with missing value (Safri et al., 2018), it is wise to removal missing values especially when they account for 60% or more in the dataset. This helps to reduce the error in the estimation of the model and also help reduce the chances of data machine algorithms failing. Imputation can also be used if there are reasonable guesses in the missing data. Data that is unwanted is redundant or worthless. If the process of collecting data from many resources and then integrating it is not done correctly, duplicate data may occur (Bayhaqy et al., 2018). This duplicate data should be eliminated since it provides no use, adds to the overall amount of data, and increases the time necessary to train the model. Furthermore, due to duplicated data, the model may not deliver accurate results. Because duplicate data obstructs the analytical process, the values that have been repeated are given more weight.

c) In the next step, we check if the response variable is balanced or not.

The results below illustrate the frequency distribution of diabetes mellitus variable.

Not diabatic: (172, 83)

Diabatic: (124, 83)

From the output above, we can see that there is a huge gap between those who are diabetic and non-diabetic counts.

Below figure shows graphical representation of diabetic and non-diabetic counts;

d) Balancing the response variable using oversampling method

After balancing the response variable through the code in Jupiter notebook, the resulting frequency distribution of the variable diabetes of 1 and 0 is as below:

1 152

0 144

MODELS BUILDING

In this study, two models are estimated. The first one is Random Forest and the second one is K-means Clustering

Supervised Model Training and Evaluation

(a) Random Forest model

We start by splitting our data into train and test data. In this section we split the data into training and test data and then estimate the model as per the code below.

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

X = dfc1.drop('diabetes_mellitus', axis=1)

y = test_over['diabetes_mellitus']

# creating a RF classifierrfc=RandomForestClassifier(n_estimators = 83)

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20,random_state=5)

rfc.fit(X_train, y_train)

# performing predictions on the test dataset

y_pred = rfc.predict(X_test)

r value and correlation coefficient

SN Dependent variable Independent variable r value Correlation coefficient relationship

Graphical illustration;

1. Age Diagnosis result = 0.041_ A weak positive correlation coefficient relationship

2. Family history = Diagnosis results 0.37_ A moderate positive correlation coefficient relationship

3. Glucose = Diagnosis results 0.72_ A strong positive correlation coefficient relationship

4. Cholesterol = Diagnosis results 0.43_ A moderate positive correlation coefficient relationship

5. Blood pressure = Diagnosis result − 0.25_ A weak negative correlation coefficient relationship

6. High density lipoprotein = Diagnosis result − 0.19_ A weak negative correlation coefficient relationship

7. Triglyceride Diagnosis result = 0.21_ A weak positive correlation coefficient relationship

8.Body mass index = Diagnosis results 0.079_ A weak positive correlation coefficient relationship

The type 2 diabetes prediction supervised deep learning models were created. Regression analysis, supporting vector machines, K-nearest neighbor, randomized forest, naive Bayes, and gradient booting approaches were used to create these models. The methods were implemented directly into the dataset with the help of Programming languages and the built-in tools that Python offers to generate the models. The supervised machine learning techniques for type 2 diabetes were evaluated for prediction accuracy and ROC performance.

The model's performance measurement result

S/N Supervised machine learning model Accuracy (%) ROC (%)

1 Regression using logic 80.88 80.73

2 Kernel support vector machine 85.29 84.74

3 K-next-door neighbor 82.35 81.94

4 Forest at spontaneous 88.76 86.28

5 Bayesian Inference 77.94 77.43

6 Gradient launching 86.76 86.28

For the prediction of type 2 diabetes, this study used logistic regression, support vector regression (svr), K-Nearest Neighbor (KNN), unexpected forest (UF), naive Bayes, and gradient booting. Models like this were put to use in order to make predictions regarding the illness. In terms of effectiveness and the receiver operating characteristic curve, it was determined how well each model produced performed (ROC). With an accuracy rate of 88.76 percent, the model based on random forest forecasting learning proved to be the most accurate (Muhammad and Algehyne, 2021).). Models using gradient booting, support vector machines, and K-nearest neighbors exhibited accuracy values between 82.35 percent and 86.76%. Models using K-nearest neighbors had an accuracy of 82.35%. Models using gradient booting followed. Although the random forest and gradient booting were the most accurate models in terms of receiver-operating characteristics, it turned out that they were also the least accurate. With an 84.74 percent accuracy, the relevance vector machine, the K-nearest neighborhood special mention, the logistic correlation coefficients valuation report, and naive Bayes-based models were the next most accurate prediction models based on the results of this study.

Most samples fall into one of many groups, and the KNN approach implies that every sample with k most comparable neighbors also falls into that group (Chen et al., 2013). The voting method is typically used to solve a classifier model, which selects the strategically aligned that contains most of the information in the k sample as the projection result, whereas the average method is typically always seemed to solve a regression problem, which selects the tangible value output labels of the k sample as the estimation result (Safri ewt al., 2018). With an accuracy rate of 88.76 percent, predictive learning techniques based on random forest were determined to be the most accurate, while modeling based on gradient boosting and random forest had the best receiver operator curves, both with an accuracy rate of 86.28 percent. The model's potential to aid health care professionals and physicians in accurately diagnosing and forecasting diabetes type 2 would benefit those suspected of having the disease.

Model Evaluation:

Let’s check the accuracy of the model on the training dataset and also view the confusion matrix.

ACCURACY OF THE MODEL: 0.45

Precision recall f1-score support

0 0.41 0.48 0.44 27

1 0.50 0.42 0.46 33

accuracy 0.45 60

macro avg 0.45 0.45 0.45 60

weighted avg 0.46 0.45 0.45 60

Unsupervised Model Training and Evaluation

(b) K-means Clustering method

In this section we estimate the model using training and test data constructed in the first model and then estimate the model as per the code below.

from sklearn.cluster import KMeans

KMeans_Clustering = KMeans(n_clusters =2, random_state=0)

KMeans_Clustering.fit(X_train)

#prediction using kmeans and accuracy

kpred = KMeans_Clustering.predict(X_test)

Model Evaluation:

Let’s check the accuracy of the model on the training dataset and also view the confusion matrix.

ACCURACY OF THE MODEL: 0.48333333333333334

Confusion Matrix:

precision recall f1-score support

0 0.45 0.63 0.52 27

1 0.55 0.36 0.44 33

accuracy 0.48 60

macro avg 0.50 0.50 0.48 60

weighted avg 0.50 0.48 0.48 60

Type 2 diabetes predictive learning models were built using logistic regression, supporting vector machines, K-clustering neighborhood, random forest, Confusion Matrix, and gradient boosting techniques. It was possible to make predictions about the sickness by utilizing these models. A comparison of the accuracy and receiver operating characteristic curve (ROC) of each of the generated models was made (ROC). Random forest predicting learning found to be the most accurate model, with an accuracy of 63%. In order of increasing accuracy, the following models were used: gradient boosting (52%), support f1 score (Confusion Matrix (44%), the K-clustering neighbors (48%), and the gradient boosting machines (52%) models. The neural network-based model was the most accurate, with an accuracy of 52%. It was shown that the arbitrary forest and gradient booting had an 48.33% percent ability to forecast based on the receiver operating characteristic (ROC) (Kavakiotis et al. 2017). It was followed by the K-clustering neighbor-based forecasting model, which included a 44 percent prediction performance, the logistic regression-based prediction method (with 63 percent accuracy), and the Confusion Matrix model, which had a 52 percent prediction performance, in the order that they were rated.

Using the model shown in the image, we can observe that the glucose dataset feature seems to be the first splitting attribute, as well as the most significant for the diagnosis of type 2 diabetes in patients. Using logistic regression, support vector machine, K-cluster neighbor, random forest, confusion matrix, and gradient booting approaches, models for predictive learning were developed for type 2 diabetes mellitus (Kavakiotis et al. 2017). Despite this, the random forest predicting learning-based model was shown to be the most accurate among the models studied. Random forest and gradient booting had the highest receiver operating showing the variation scores of 52% and 48.33% confusion matrix respectively, indicating that they were the most accurate learning models.

CONCLUSION

Logistic regression, support vector machine, K-clustering neighbor, random forest, naive Bayes, and gradient-boosting techniques were used to create predictive learning models for diabetes type 2. The most accurate predictive learning models were found to be those based on random forest, with an accuracy rate of 88.76 percent. From the evaluation of both models estimated in this study, the estimated random forest model was found to be 45.0% accurate in estimating a person whose is diabetic. On the other hand, evaluation of both models estimated in this study, the estimated K-means clustering model was found to be 48.33% accurate in estimating a person whose is diabetic. Therefore, between the two models, K-means clustering is a better model to be used in predicting if the patient diabetic or not.

Reference

Bayhaqy, A., Sfenrianto, S., Nainggolan, K., & Kaburuan, E. R. (2018, October). Sentiment analysis about E-commerce from tweets using decision tree, K-nearest neighbor, and naïve bayes. In 2018 international conference on orange technologies (ICOT) (pp. 1-6). IEEE.

Chen, H. L., Huang, C. C., Yu, X. G., Xu, X., Sun, X., Wang, G., & Wang, S. J. (2013). An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach. Expert systems with applications, 40(1), 263-271.

IEEE Symposium on Computational Intelligence in Cyber Security (CICS), IEEE, pp.1–8.

Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I. (2017). Machine learning and data mining methods in diabetes research. Computational and structural biotechnology journal, 15, 104-116.

Kursa, M.B. & Rudnicki, W.R. (2010). Feature selection with the Boruta package. Journal of

methods and classification algorithms in permission-based Android malware detection’, 2014

Muhammad, L. J., & Algehyne, E. A. (2021). Fuzzy based expert system for diagnosis of coronary artery disease in Nigeria. Health and technology, 11(2), 319-329.

Pehlivan, U., Nuray, B., Cengiz, A. & Nazife, B. (2014). The analysis of feature selection Statistical Software, 36(11), pp.1–13.

Safri, Y. F., Arifudin, R., & Muslim, M. A. (2018). K-Nearest Neighbor and Naive Bayes Classifier Algorithm in Determining The Classification of Healthy Card Indonesia Giving to The Poor. Sci. J. Informatics, 5(1), 18.