Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSE 5243 - Introduction to Data Mining

Homework 1: Exploratory Data Analysis

Fall 2022

Introduction

This homework will focus on a modiied version of the kaggle dataset "Pima Indians Diabetes   Database". It can be found here. The overarching objective is to diagnostically predict whether  or not a patient has diabetes based upon several other covariates. The full description is shown on the website.

Your task will be to irst: 1) Do the prerequisite EDA to understand the data set you will be working on.

2) Fit an appropriate logistic model and analyze it.

While some of the questions have exact answers, a few others are more open to interpretation. However, what we're looking for is the correct thinking ana analysis. For the objective questions, while some points are awarded for "the correct number", the majority of the points will be awarded for a proper analysis and logical investigation.

Note: The data has been modiied in both some subtle and not-so-subtle ways. You're welcome to look at other previous work online (in kaggle, stack overlow, etc -- and in fact that's critical to learning how to write good code!) but be wary about just using other people's work. It would both be a violation of the academic code of conduct, but it may also lead you down the wrong path

Collaboration

For this assignment, you should work as an individual. You may informally discuss ideas with classmates, but your work should be your own.

What you need to turn in:

1) Code

For this homework, the code is the Jupyter Notebook. Use the provided Jupyter Notebook

template, and ill in the appropriate information.


You may use common Python libraries for I/O, data manipulation, data visualization, etc. (e.g., NumPy, Pandas, MatPlotLib,… See reference below.)

You may not use library operations that perform, in effect, the corecomputations for this

homework (e.g., If the assignment is to write a K-Means algorithm, you may not use a library operation that, in effect, does the core work needed to implement a K-Means

algorithm.). When in doubt, ask the grader or instructor.

The code must be written by you, and any signiicant code snips you found on the Internet

and used to understand how to do your coding for the core functionality must be

attributed. (You do not need to attribute basic functionality matrix operations, IO, etc.) The code must be commented suiciently to allow a reader to understand the algorithm

without reading the actual Python, step by step.

When in doubt, ask the grader or instructor.

2) Written Report

For this homework, the report is the Jupyter Notebook. The report should be well-written. Please proof-read and remove spelling and grammar errors and typos.

The report should discuss your analysis and observations. Key points and indings must

be written in a style suitable for consumption by non-experts. Present charts and graphs  to support your observations. If you performed any data processing, cleaning, etc., please discuss it within the report.

Grading

1. Overall readability and organization of your report (10%)

Is it well organized and does the presentation low in a logical

manner?

Are there no grammar and spelling mistakes?

Do the charts/graphs relate to the text?

Are the summarized key points and indings understandable by non- experts?

Do the Overview and Conclusions provide context for the entire

exercise?

2. Domain Understanding Phase (10%)

Did you provide a reasonable level of information?

3. Data Understanding Phase (30%)

Did you ind novel and/or interesting insights, or did you solely focus on simple summarizations of the data?

Did you draw and present potential conclusion or observations from

your analysis of the data?

Did the statistics and visualizations you used make sense in the context of the data?



4. Data Analysis Phase (40%)

Did you correctly do the data cleaning steps and perform the

appropriate logistic regression.

Was your analysis of the signiicant variables appropriate.

How have you justiied your feature transformation and/or feature

creation steps.

5. Conclusions (10%)

Did you summarize appropriately your critical indings.

Did you provide appropriate conclusions and next steps.

How to turn in your work on Carmen:

Submit to Carmen the Jupyter Notebook, the html print out of your Jupyter notebook, and any supporting iles that you used to process and analyze this data. You do not need to include the input data. All submitted iles (code and/or report) except for the data should be archived in a *.zip ile, and submitted via Carmen. Use this naming convention:

• Project1_Surname_DotNumber.zip

The submitted ile should be less than 10MB.

Section 0: Setup

Add any needed imports, helper functions, etc., here.

In this section, the necessary libraries are imported. The pandas library is used for data manipulation and analysis. The numpy library is used for numerical computing. The matplotlib.pyplot library is used for plotting and data visualization. The seaborn library is used for higher-level interface to the matplotlib library. The %matplotlib inline command is used to display the plots in the Jupyter Notebook.

1 import  pandas  as  pd

2 import  numpy  as  np

3 import  matplotlib.pyplot  as  plt

4 import  seaborn  as  sns

5 import  statsmodels.api  as  sm

6 from  sklearn.linear_model  import  LogisticRegression

7 from  sklearn.metrics  import  confusion_matrix

8 from  sklearn.metrics  import  roc_auc_score,  roc_curve


9 %matplotlib  inline

Section: 1 - Domain Understanding

Write a few paragraphs providing an overview of the data. Some questions you should

consider are: Where did the data come from? What do the rows represent? Why and how  was the data collected? Who might use this data? What types of questions might users be

able to analyze with this data?

You should review the dataset description information on the webpage to get some

context. Of course you will only have limited background on this topic (and you are not expected to become an expert), so do your best to imagine the context for the work,    making reasonable assumptions as appropriate. At this stage, you are not analyzing   individual attributes, but discussing the dataset in aggregate.

The dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnose whether or not a patient has diabetes based on various medical predictor variables. The data was collected from female patients at least 21 years old of Pima Indian heritage.

The dataset consists of 7 predictor variables such as number of pregnancies, plasma glucose concentration, blood pressure, skin thickness, insulin, Body Mass Index (BMI), and diabetes pedigree function and one target variable (Outcome). The outcome variable is a binary class variable, with 268 of 768 instances being 1 and the rest being 0.

This dataset is useful for medical researchers and health professionals to analyze and diagnose diabetes in patients based on various medical factors. The data could be used to build machine learning models for diagnosing diabetes, identifying risk factors, and determining the most        effective treatment methods. With this data, researchers can analyze the relationship between   predictor variables and the outcome, and make informed decisions about patient care.

Section: 2 - Data Understanding

Perform exploratory data analysis of the dataset by looking at individual attributes and/or combinations of attributes. You should focus on identifying and describing interesting     observations and insights that you might uncover from the data.

You should not simply provide the basic EDA information for all attributes in the data

(although that's a good irst step!). Instead, you should focus on those that are more         interesting or important, and provide some discussion of what you observe. Pay particular attention to potentially interesting bivariate (two-variable) relationships, as well as the       relationship between each attribute and the outcome.

Section: 2.1 - Describe the meaning and type of data for each feature.

The features of the dataset include:

Pregnancies: This feature represents the number of times a patient has been pregnant. It is a numerical, integer type data.

Glucose: This feature represents the plasma glucose concentration after