闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DS 303 Homework 8

Fall 2022

Problem 1: Concept Review

a. Explain in plain language how we obtain estimates for β when ﬁtting a logistic regression model.

b. When using logistic regression, what threshold will give us the smallest overall misclassiﬁca- tion rate? Explain brieﬂy why.

c. Suppose we are trying to build a classiﬁer where Y can take on two classes: ‘sick’ or ‘healthy’ . In this context, we consider a positive result to be testing sick (you have the virus) and a negative result to test as healthy (you don’t have the virus). After ﬁtting the model with LDA in R, we compare how our classiﬁer performs with the actual outcomes of the individuals, as shown below:

#rows are predicted, columns are true outcomes

#so the number of actually sick people is 65

lda .pred sick healthy

sick 40 32

healthy 25 121

What is the misclassiﬁcation rate for the LDA classiﬁer above? In the context of this problem, which is more troubling: a false positive or a false negative? Depending on your answer, how could you go about decreasing the false positive or false negative rate? Comment on how this will likely aﬀect overall the misclassiﬁcation rate (consider which threshold will have the lowest overall misclassiﬁcation rate). Consider the dataset:

x y

-2

-1

red

blue

red

blue

We use logistic regression to ﬁt a model to this data: that is, Y is binary variable that is either red or blue. Our model is estimating:

1 exp(β0 + β1xi)

1 + exp(β0 + β1xi) 1 + exp(β0 + β1xi)

for all i = 1, 2, 3, 4, 5. What value(s) of β0 and β 1 would maximize the likelihood (and therefore be the estimates we would get from ﬁtting this model)? Recall that our likelihood looks like:

l(β0 , β 1 , X) = P (Y1 = redlβ0 , β 1 , x1 ) × P (Y2 = bluelβ0 , β 1 , x2 ) × . . . × P (Y5 = bluelβ0 , β 1 , x5 ).

Hint: What is P (Yi = bluelxi > 4)? Now what is the P (Y2 = bluelx2 = 5)? What values of β0 and β 1 will get us close to this probability?

d. Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We ﬁt a logistic regression model and produce estimated regression coeﬃcients, βˆ0 = -6, βˆ1 = 0.05, and βˆ2 = 1.

i. Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.

ii. How many hours would the student in part (i) need to study to have a 50% chance of getting an A in the class?

Problem 2: Email Spam

We will use a well-known dataset to practice classiﬁcation. You can ﬁnd it here: https://archive. ics.uci.edu/ml/datasets/Spambase. Read the attribute information and download the dataset

onto your computer. To load this data into R, use the follow code:

spam = read.csv(‘ .../spambase.data ’,header=FALSE)

The last column of the spam data set, called V58, denotes whether the e-mail was considered spam (1) or not (0).

a. What proportion of emails are classiﬁed as spam and what proportion of emails are non-spam?

b. Carefully split the data into training and testing sets. Check to see that the proportions of spam vs. non-spam in your training and testing sets are similar to what you observed in part (a). Report those proportions here.

c. Fit a logistic regression model here and apply it to the test set. Use the predict() function to predict the probability that an email in our data set will be spam or not. Print the ﬁrst ten predicted probabilities here.

d. We can convert these probabilities into labels. If the predicted probability is greater than 0.5, then we predict the email is spam ( i = 1), otherwise it is not spam ( i = 0). Create a confusion matrix based on your results. What’s the overall misclassiﬁcation rate? Break this down and report the false negative rate and false positive rate.

e. What type of mistake do we think is more critical here: reporting a meaningful email as spam or a spam email as meaningful? How can we adjust our classiﬁer to accommodate this?

Problem 3: Weekly data set

We will use the Weekly data set from the ISLR2 library. It contains 1,089 weekly returns for 21 years, from the beginning of 1990 to 2010.

a. Use the full data set to perform a logistic regression with Direction as the response the ﬁve lag variables plus Volume as predictors. Use the summary function to report your results.

b. Set a threshold that minimizes the overall misclassiﬁcation rate. Compute the confusion matrix and overall correct classiﬁcation rate. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

c. Now ﬁt the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Set a threshold that minimizes the overall misclassiﬁcation rate. Compute the confusion matrix and overall correct classiﬁcation rate on the test set (that is, data from 2009 and 2010).

2022-11-03

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言