闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DS 303 Homework 9

Due: Nov . 14, 2022 on Canvas by 11:59 pm (CT)

Instructions: Homework is to be submitted on Canvas by the deadline stated above. Please clearly print your name and student ID number on your HW.

Show your work (including calculations) to receive full credit. Please work hard to make your submission as readable as you possibly can - this means no raw R output or code (unless it is asked for speciﬁcally or needed for clarity).

Code should be submitted with your homework as a separate ﬁle (for example, a .R ﬁle, text ﬁle, Word ﬁle, or .Rmd are all acceptable). You should mark sections of the code that correspond to diﬀerent homework problems using comments (e.g. ##### Problem 1 #####).

Problem 0: Final Project

a. Who are your team members for the ﬁnal project? List their names here.

b. What is your team name?

c. What topic have you chosen to present?

Problem 1: Concept Review

a. Suppose you just took on a new consulting client. He tells you he has a large dataset (say 100, 000 observations) and he wants to use this to classify whether or not to invest in a stock based on a set of p = 10, 000 predictors. He claims KNN will work really well in this case because it is non-parametric and therefore makes no assumptions on the data. Present an argument to your client on why KNN might fail when p is large relative to the sample size.

b. For each of the following classiﬁcation problems, state whether you would advise a client to use LDA, logistic regression, or KNN and explain why:

i. We want to predict gender based on height and weight. The training set consists of heights and weights for 82 men and 63 women.

ii. We want to predict gender based on annual income and weekly working hours. The training set consists of 770 mean and 820 women.

iii. We want to predict gender based on a set of predictors where the decision boundary is complicated and highly non-linear. The training set consists of 960 men and 1040 women.

c. If the true decision boundary between two groups is linear and the constant variance assump- tion holds, do you expect LDA or QDA to perform better on the testing set? Explain using concepts from bias/variance tradeoﬀ.

d. Same question as (c), but what if we compare the performance of LDA and QDA on the training set? Which will perform better?

e. True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is ﬂexible enough to model a linear decision boundary. Justify your answer.

f. Create a data set that consists of two predictors (X1 , X2) and a binary response variable Y . Let n = 16 and Y = 0 for 8 observations and Y = 1 for the remaining 8 observations. Create this data set in such a way that logistic regression cannot converge when applied to this data set. Explain why logistic regression cannot converge on this data set. Using logistic regression, obtain the predicted probabilities for data set and report them here. You may copy/paste your output.

g. Apply LDA/QDA to the dataset you created in part (h). Are you able to get meaningful results? Report the misclassiﬁcation rate for LDA and QDA.

Problem 2: Practicing data simulations

Let us simulate data where we know the true P (Y = 1|X). Suppose Y can only take on 0 or 1. We have 3 predictors of interest. Fill in the following code to simulate classiﬁcation data.

a. set .seed(1)

x1 = rnorm(1000) # create 3 predictors

x2 = rnorm(1000)

x3 = rnorm(1000)

#true population parameters

B0 = 1

B1 = 2

B2 = 3

B3 = 2

# construct the true probability of Y =1 using the logistic function . pr = ??

# randomly generate our response y based on these probabilities y = rbinom(1000,1,pr)

df = data .frame(y=y,x1=x1,x2=x2, x3=x3)

b. On the simulated data, ﬁt a logistic regression model with Y as the response and X1, X2 , X3 as the predictors. Compute the confusion matrix and the misclassiﬁcation rate.

c. On the simulated data, apply LDA. Compute the confusion matrix and the misclassiﬁcation rate.

d. On the simulated data, apply Naive Bayes. Compute the confusion matrix and the misclas- siﬁcation rate.

e. How do the 3 methods compare?

Problem 3: k-NN

Assume our outcome Y can take on Y = 0, Y = 1 or Y = 2 (3 categories). Suppose we have a training data set with 5 observations. We want to classify a test observation using KNN. Below are all the distances between each of the 5 observations in training set and the test observation.

Training Observation (i): 1 2 3 4 5

Yi label:

distance:

a. Based on the above, how would we classify our test observation using K = 1?

b. How would we classify our test observation using K = 3?

c. KNN is highly dependent on the choice of K . Discuss the bias/variance consideration we make in choosing K .

Problem 4: Email Spam Part 2

Use the Spam data set, from HW 8, for this problem. Repeat your code from Problem 2 parts (a), (b), and (c).

a. What type of mistake do we think is more critical here: reporting a meaningful email as spam (false positive) or a spam email as meaningful (false negative)?

b. Fit a logistic regression model here and apply it to the test set. Based on your answer to part (a), plot the ROC curve of true positive rate vs. false positive rate or true negative rate vs. false negative rate.

c. Output the confusion matrix. What is the false positive and false negative rate when we set the threshold to be 0.5?

d. Adjust the threshold such that your chosen error (false positive or false negative) is no more than 0.03. You should choose the threshold carefully so that the true positive and true negative rate are also maximized.Report that threshold here.

e. Implement LDA and repeat parts (b) -(d).

f. Carry out QDA, Naive Bayes and KNN on the training set. You should experiment with values for K in the KNN classiﬁer using cross-validation. Remember to standardize your predictors for KNN. For each classiﬁer, report the confusion matrix and overall test error rates for each of the classiﬁers.

g. Which classiﬁer would you recommend for this data? Justify your answer.

2022-12-13

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple