闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

IEOR E4525

Assignment 1

2022

All ﬁles referred to in this homework can be found on CourseWorks.

Your hand-in should be made on Gradescope.

For theory questions, you will need to write math. Please make sure to show your derivations. To see examples of how to write latex code in jupyter, see this link. You may also take pictures of hand-written solutions, if you ﬁnd that easier. In that case it is your responsibility that the pictures and hand-writing is suﬃciently legible for grading.

You must submit two ﬁles:

1. A ﬁle (i.e. jupyter notebook ﬁle or a python script) with your data analysis for answering questions

2. A single pdf ﬁle with all your answers to all the questions, including your jupyter notebook output for the above notebook. You can create a pdf ﬁle of your notebook using the workﬂow described here, or you can insert screenshots of the notebook in another pdf. Alternatively, you can insert your answers to theoretical questions as part of your notebook.

1 Lab 3.6 from ISLR

Go through the lab exercise in Section 3.6 of ISLR. The book is written to use the programming language R for these exercises. Since we will use python for assignments, you must use python to complete each of these data analysis tasks.

I have provided a corresponding set of python commands that you can use. These are given in the form of a worked Jupyter notebook. You must write your own data analysis, but you are welcome to use my notebook as a reference (and it is also ok to use exactly the same commands in your notebook). If you wish to ﬁgure out how to do the data analysis without the help of my notebook, then you will most likely learn more. In that case, please have a look at the very ﬁrst cell of my notebook, to see the recommended python libraries.

Questions

1. Compare the plots of the residuals vs. the ﬁtted values for the regression medv lstat + np.square(lstat) and the regression using only lstat as a predictor. What’s the qualitative diﬀerence?

2. Does the ﬁfth-order polynomial from your python regression correspond to the one from the ISLR book? If not, why might this occur?

2 EDA with the Spam Filtering Data Set

The csv ﬁle spam.csv contains a data set for emails that were categorized as spam or not spam. The documentation for this data set is in the ﬁle spam-info.pdf.

1. Look at the documentation. What is the variable of interest, i.e. the dependent variable?

2. For each of the independent variables, report something about it. Speciﬁcally, you should report on each variable’s relationship with the response, i.e dependent, variable. Pay special attention to variable type (binary, ordinal, real) when doing this. Your comments should contain at least some tables and graphs.

3. Investigate the variable ’spampct’.

(a) How many missing values does it have?

(b) Compare graphically the distribution for time.of.day for the cases where spampct is missing against the distribution of time.of.day when spampct is present. Do you see any diﬀerences?

(c) Plot a scatter plot of time of day vs. spampct. How many unique points (x,y coordinates) are plotted? Explain a technique you might use to deal with the overplotting.

3 Exploring the Relationship Between Overﬁtting and Noise

Do exercise 13 from Section 3.7 of ISLR. The example codes are for R, but below I provide a table of translations to python. You will need to use the numpy documentationto look up how to use the various commands. Make sure you look up the documentation for your version of numpy.

R command python command

set.seed(1) np.random.seed(1)

rnorm() np.random.randn()

4 Naive Bayes and Spam Filtering

1. Use the spam data from Question 2 and Naive Bayes to build a classiﬁer that distinguishes spam from non-spam. You can use Naive Bayes from sklearn for this. Your code should split the data into training and test sets and then estimate the generalization error of your classiﬁer.

2. Randomly assign 80% of your data to the training set, 20% to the test set and now estimate the test error, Etest , of your classiﬁer. Repeat this 10 times. How much variability do you see in Etest ? What conclusions can you draw from this?

3. There are two types of error that a spam classiﬁer can make. Should these errors be treated equally when constructing a classiﬁer. Can we adapt our naive Bayes classiﬁer to reﬂect this?

5 Least Squares Linear Regression is MLE for Gaussian noise

Consider the linear regression model

Y = XT β + ∈,

where β, X e Rd , are ﬁxed, and the error ∈ λ N(0, σ2 ) is distributed according to a Gaussian distribution.

In class we saw how to derive the least squares estimator. In this exercise, you just must prove that the least squares estimator is also the maximum-likelihood estimator, given that the error is Gaussian.

6 k Nearest Neighbors and the Curse of Dimensionality