CSE 5243 - Introduction to Data Mining Homework 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSE 5243 - Introduction to Data Mining
Homework 1: Exploratory Data Analysis
Fall 2022
Introduction
This homework will focus on a modiied version of the kaggle dataset "Pima Indians Diabetes Database". It can be found here. The overarching objective is to diagnostically predict whether or not a patient has diabetes based upon several other covariates. The full description is shown on the website.
Your task will be to irst: 1) Do the prerequisite EDA to understand the data set you will be working on.
2) Fit an appropriate logistic model and analyze it.
While some of the questions have exact answers, a few others are more open to interpretation. However, what we're looking for is the correct thinking ana analysis. For the objective questions, while some points are awarded for "the correct number", the majority of the points will be awarded for a proper analysis and logical investigation.
Note: The data has been modiied in both some subtle and not-so-subtle ways. You're welcome to look at other previous work online (in kaggle, stack overlow, etc -- and in fact that's critical to learning how to write good code!) but be wary about just using other people's work. It would both be a violation of the academic code of conduct, but it may also lead you down the wrong path
Collaboration
For this assignment, you should work as an individual. You may informally discuss ideas with classmates, but your work should be your own.
What you need to turn in:
1) Code
For this homework, the code is the Jupyter Notebook. Use the provided Jupyter Notebook
template, and ill in the appropriate information.
You may use common Python libraries for I/O, data manipulation, data visualization, etc. (e.g., NumPy, Pandas, MatPlotLib,… See reference below.)
You may not use library operations that perform, in effect, the “core” computations for this
homework (e.g., If the assignment is to write a K-Means algorithm, you may not use a library operation that, in effect, does the core work needed to implement a K-Means
algorithm.). When in doubt, ask the grader or instructor.
The code must be written by you, and any signiicant code snips you found on the Internet
and used to understand how to do your coding for the core functionality must be
attributed. (You do not need to attribute basic functionality – matrix operations, IO, etc.) The code must be commented suiciently to allow a reader to understand the algorithm
without reading the actual Python, step by step.
When in doubt, ask the grader or instructor.
2) Written Report
For this homework, the report is the Jupyter Notebook. The report should be well-written. Please proof-read and remove spelling and grammar errors and typos.
The report should discuss your analysis and observations. Key points and indings must
be written in a style suitable for consumption by non-experts. Present charts and graphs to support your observations. If you performed any data processing, cleaning, etc., please discuss it within the report.
Grading
1. Overall readability and organization of your report (10%)
Is it well organized and does the presentation low in a logical
manner?
Are there no grammar and spelling mistakes?
Do the charts/graphs relate to the text?
Are the summarized key points and indings understandable by non- experts?
Do the Overview and Conclusions provide context for the entire
exercise?
2. Domain Understanding Phase (10%)
Did you provide a reasonable level of information?
3. Data Understanding Phase (30%)
Did you ind novel and/or interesting insights, or did you solely focus on simple summarizations of the data?
Did you draw and present potential conclusion or observations from
your analysis of the data?
Did the statistics and visualizations you used make sense in the context of the data?
4. Data Analysis Phase (40%)
Did you correctly do the data cleaning steps and perform the
appropriate logistic regression.
Was your analysis of the signiicant variables appropriate.
How have you justiied your feature transformation and/or feature
creation steps.
5. Conclusions (10%)
Did you summarize appropriately your critical indings.
Did you provide appropriate conclusions and next steps.
How to turn in your work on Carmen:
Submit to Carmen the Jupyter Notebook, the html print out of your Jupyter notebook, and any supporting iles that you used to process and analyze this data. You do not need to include the input data. All submitted iles (code and/or report) except for the data should be archived in a *.zip ile, and submitted via Carmen. Use this naming convention:
• Project1_Surname_DotNumber.zip
The submitted ile should be less than 10MB.
Section 0: Setup
Add any needed imports, helper functions, etc., here.
In this section, the necessary libraries are imported. The pandas library is used for data manipulation and analysis. The numpy library is used for numerical computing. The matplotlib.pyplot library is used for plotting and data visualization. The seaborn library is used for higher-level interface to the matplotlib library. The %matplotlib inline command is used to display the plots in the Jupyter Notebook.
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 import statsmodels.api as sm
6 from sklearn.linear_model import LogisticRegression
7 from sklearn.metrics import confusion_matrix
8 from sklearn.metrics import roc_auc_score, roc_curve
9 %matplotlib inline
Section: 1 - Domain Understanding
Write a few paragraphs providing an overview of the data. Some questions you should
consider are: Where did the data come from? What do the rows represent? Why and how was the data collected? Who might use this data? What types of questions might users be
able to analyze with this data?
You should review the dataset description information on the webpage to get some
context. Of course you will only have limited background on this topic (and you are not expected to become an expert), so do your best to imagine the context for the work, making reasonable assumptions as appropriate. At this stage, you are not analyzing individual attributes, but discussing the dataset in aggregate.
The dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnose whether or not a patient has diabetes based on various medical predictor variables. The data was collected from female patients at least 21 years old of Pima Indian heritage. The dataset consists of 7 predictor variables such as number of pregnancies, plasma glucose concentration, blood pressure, skin thickness, insulin, Body Mass Index (BMI), and diabetes pedigree function and one target variable (Outcome). The outcome variable is a binary class variable, with 268 of 768 instances being 1 and the rest being 0. This dataset is useful for medical researchers and health professionals to analyze and diagnose diabetes in patients based on various medical factors. The data could be used to build machine learning models for diagnosing diabetes, identifying risk factors, and determining the most effective treatment methods. With this data, researchers can analyze the relationship between predictor variables and the outcome, and make informed decisions about patient care. |
|
Section: 2 - Data Understanding |
|
|
Perform exploratory data analysis of the dataset by looking at individual attributes and/or combinations of attributes. You should focus on identifying and describing interesting observations and insights that you might uncover from the data. You should not simply provide the basic EDA information for all attributes in the data |
(although that's a good irst step!). Instead, you should focus on those that are more interesting or important, and provide some discussion of what you observe. Pay particular attention to potentially interesting bivariate (two-variable) relationships, as well as the relationship between each attribute and the outcome. |
Section: 2.1 - Describe the meaning and type of data for each feature.
The features of the dataset include: Pregnancies: This feature represents the number of times a patient has been pregnant. It is a numerical, integer type data. Glucose: This feature represents the plasma glucose concentration after |
2023-02-09