Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

HW1

January 22, 2024

1 Homework 1

In this assignment, we will be exploring the car dataset and analyzing their fuel efficiency. Specifi- cally, we will do some exploratory analysis with visualizations, then we will build a model for Simple Linear Regression, a model for Polynomial Regression, and one model for Logistic Regression. The given dataset is already modified and cleaned, but you can find the original information here..

1.1 Dataset Attribute Information

1. mpg: Miles per gallon. This is one primary measurement for car fuel efficiency. 2. displacement : The cylinder volumes in cubic inches.

3. horsepower : Engine power.

4. weight : In pounds.

5. acceleration : The elapsed time in seconds to go from 0 to 60mph.

6. origin : Region of origin.

1.1.1   Libraries that can be used:  numpy, pandas, scikit-learn, seaborn, plotly,  mat- plotlib

Any libraries used in the discussion materials are also allowed.

Other Notes

•  Don’t worry about not being able to achieve high accuracy,  it is neither the goal nor the grading standard of this assignment.

•  If not specified, you are not required to do hyperparameter tuning, but feel free to do so if you’d like.

•  Discussion materials should be helpful for doing the assignments.

2 Exercises

2.1 Exercise 1 - Exploratory Analysis (20 points in total)

2.1.1 Exercise 1.1 - Correlation Matrix (10 points)

Generate a Pearson correlation matrix plot in the form of a heatmap. See the link to have an idea about what this visualization should look like. After generating the plot, answer the following question: If we are going to predict mpg in Simple Linear Regression(i.e., g = a + b), which attribute are you most UNLIKELY to pick as the independent variable?  Explain why.

Requirements & notes - When computing correlation, make sure to drop the column origin to avoid errors.  - The computed correlation values should be shown on the plot.  - Use a diverging color scale with the color range being [-1, 1] and center being 0 (if applicable).

2.1.2 Exercise 1.2 - Pairplot (10 points)

Generate a pairplot(a.k.a. scatter plot matrix) of the given dataset. After generating the plot, answer the following question: If we  are  using horsepower to predict mpg, which method could lead to the best performance? (Linear Regression, Polynomial Regression, or Logistic Regression) Explain why.

Note that there is no requirement on the diagonals. You can leave empty or use other representa- tions based on your preference. However, having origin-based grouped data distributions on the diagonals effectively helps you answer some questions in the later exercises.

Requirements - The points should be colored based on the column origin.

2.2   Exercise 2 - Linear and Polynomial Regression  (30 points in total)

2.2.1 Exercise 2.1 - Splitting Dataset (5 points)

Split the data into training and testing set with the ratio of 80:20.

2.2.2   Exercise 2.2 - Simple Linear Regression  (10 points)

Using one of the other attributes(excluding origin) by your choice, please build a simple linear regression model that predicts mpg.

Requirements - Report the testing MSE error.

2.2.3   Exercise 2.3 - Polynomial Regression  (15 points)

Build polynomial regression models that predict mpg with the same choice in 2.2.  Specifically, from degree=2 to degree=4, build one respectively. Then, based on the reported errors from only these three degrees, do you think there is a sign of overfitting?  Provide your reasoning.

Requirements - Report the training MSE error for each of the three degrees.  - Report the testing MSE error for each of the three degrees.

2.3 Exercise 3 - Logistic Regression (30 points in total)

Now we are going to build a classification model on origin using all the other 5 attributes.  Note that Logistic Regression is a binary classificaiton algorithm.

2.3.1   Exercise 3.1 - Processing and Splitting the Dataset  (10 points)

In this exercise 3, we only consider those observations where they originate from either “USA” or “Japan”.  So please remove those observations that originate from “Europe”.  And then, split the data into training and testing set with the ratio of 80:20.

2.3.2 Exercise 3.2 - Logistic Regression (20 points)

Using all the other 5 attributes, please build a Logistic Regression model that distinguishes between cars from Japan and cars from the USA. Then, if we are distinguishing between Japan and Europe this time, how do you think the model performance(in terms of accuracy) will change? Provide your reasoning. (Hint: Exercise 1)

Requirements - Report the testing precision and recall for both regions.

2.4   Exercise 4 - Overfitting and Underfitting  (10 points in total)

The fitting dataset contains the actual train and test data spread for a model along with three rotations of the same. The dataset is provided in the Canvas file.

2.4.1 Exercise 4.1 - sse and variance

Calculate the sse and variance for the three predictions based on the actual data. Show the calculation for the above metrics. Highlight the values you get for all three predictions and the actual data.

2.4.2 Excercise 4.2 - Justification

Based on the values calculated above classify the predictions into three categories base predic-tion, overfitting prediction, underfitting prediction. Also provide appropriate justifications for the classifications.

2.5 Exercise 5 - Outliers (10 points in total)

Now we are going to perform outlier detection using the diabetes dataset. The dataset is provided in the Canvas file.

2.5.1 Exercise 5.1 - box plot

Extract the  ‘BloodPressure’ attribute from the diabetes dataset. Create a box plot with the s1 serum attribute. Highlight the outliers in the box plot with special colors.

2.5.2 Exercise 5.2 - anomaly detection

Extract features ‘BMI’ and ‘Insulin’ from the diabetes dataset. Implement anomaly detection using the One-Class SVM algorithm. Plot a scatter plot similar to Lecture 2 Slide  11, annotating the outlier data points.

[ ]:

[ ]: