MATH 3333 3.0 - Winter 2022-23 Assignment 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
MATH 3333 3.0 - Winter 2022-23
Assignment 2
Question 1: In maximum likelihood estimation, the estimator is obtained by maximizing the log likelihood function. However, most of the log likelihood has to be optimized by Newton-Raphson algorithm. In this question, we will learn to program Newton-Raphson algorithm for a univariate function. Consider the function f (θ) = , θ > 0. Use the Newton-Raphson method to find the maximizer of the function. Implement the algorithms in R and run your code to obtain the maximizer. (Attach your screenshots of the output in R). You can use the online symbolic differentiation calculator to obtain f/ (θ) and f// (θ) (https://www.symbolab.com/solver/second-derivative-calculator).
gx<-function(x){
p<-(3*x^2-1)/(1+x^3)
return(p)
}
gp<-function(x){
p<-(6*x-3*x^4+3*x^2)/(1+x^3)^2
return(p)
}
gpp<-function(x){
p<-6*(x^6-2*x^4-7*x^3+x+1)/(1+x^3)^3
return(p)
}
x<-(1:1000)/100
y<-numeric(1000)
for (i in 1:1000){
y[i]<-gx(x[i])
}
plot(x,y,type="l")
###Newton:the starting is 1.5
diff<-4
iter<-0
x<-1.5
while ((diff>0.001) && (iter<30)) {
oldx<-x
x<-x-gp(x)/gpp(x)
diff<-abs(x-oldx)
iter<-iter+1
print(c(iter,diff))
}
gx(x)
[[1] 1.00000000 0.02065303
[1] 2.0000000000 0.0007258094
> gx(x)
[1] 1.314596
> x
[1] 1.521379
We choose the starting point to be 1.5 by examinig the plot of the function. The optimizer is x = 1.521379, and the maximum value is 1.314596.
Question 2: In the following marketing set, we have 9 years with the sales in 10 million
euro and the advertising expenditure in million euro.
Year Sales Advertisement
1 61 19
2 73 26
3 85 30
4 106 34
5 120 43
6 129 48
7 142 52
8 144 57
9 161 68
a) Based on the 9 observation and perform a ridge regression. Program it with R. Output the ridge regression results at a few different values of λ.
xmat<-cbind(rep(1,9),c(19,26,30,34,43,48,52,57,68))
y<-c(61,73,85,106,120,129,142,144,161)
lambda=0.1
solve(t(xmat)%*%xmat+lambda*diag(2))%*%(t(xmat)%*%y)
[,1]
[1,] 21.910647
[2,] 2.179345
###now change lambda
lambda=100
solve(t(xmat)%*%xmat+lambda*diag(2))%*%(t(xmat)%*%y)
[1,] 0.2989559
[2,] 2.6217873
b) In your ridge regression, when λ increaes, what do you observe from the values of the estimated coefficients. Does any of the estimated coefficients shrink to zero like the L1 LASSO regression? Describe the difference between the output of a ridge regression and the output of a lasso regression.
The ridge regression penalizes the l2 norm of the estimates so the estimates have smaller L2 norm compared to least square estimates. On the other hand, the LASSO regression penalizes the l1 norm of the estimates so the estimates have smaller L1 norm compared to least square estimates. Unlike LASSO, even though the estimates will shrink towards zero, they will never be exactly zero. LASSO produces sparse estimates where some of the parameters are estiamted to be exactly zero.
Question 3:
a. In this question, we will investigate the problem of mulitple testing. Consider the hy- pothesis testing of H0 : µ = 0, vs Ha : µ 0. Under the null hypothesis, the Z test statistic is a standard normal random variable. We reject the null hypothesis when |Z| is greater than 1.96 at the significance level of 0.05. Write a R program to simulate 1000 Z test statistic from standard normal N(0, 1). (You can use the sample codes in our lecture notes.)
ztests=rnorm(1000,0,1)
(Note: There are many other ways of coding it. You can also follow my lecture notes and generate one sample z test from simulated data sets.)
b. If we perform hypothesis testing using the significance level of 0.05. Among the 1000 test statistics you generated, how many of them are rejected?
totalrejection<-sum(abs(ztests)>=1.96)
> totalrejection
[1] 42
c. Apply the Bonferroni method to control the overal type I error rate to be 0 .05. Based
on your program, how many rejections do you obtain from your simulation? To use the Bonferroni, we will have to use the cutoff z0.05/(2*1000) .
bonferroni<-qnorm((1-0.05/(2*1000)),0,1)
totalrejection<-sum(abs(ztests)>=bonferroni)
> totalrejection
[1] 0
d. As the Zs are generated from the null hypothesis, we consider these rejections are all false positive discoveries. Please use a short paragraph to summarize the problem we are facing when we perform multiple testings.
As we can see that if we perform multiple testing, the Type I error rate will multiply. To control the overal Type I error rate against committing one error, we can use Bonferroni’s method. However, Bonferroni’s method is very stringent. It can be used to protect us from committing at least one error, it can diminish our power to detect true positives. Thus in real high dimensional application, people tend to control the overal false discovery rate instead of the overal Type I error rate.
Question 4: Consider a data set with the response vector Y = (y1 , . . . , yn )T and the data
matrix
╱ 1 x11 x12 、
.. .
We model the relationship between X and Y using the linear regression model: yi = θ0 + θ 1 xi1 + θ2 xi2 + ci , i = 1, . . . , n, where c ~ N (0, σ2 ). Let the parameter vector be denoted as θ = (θ0 , θ 1 , θ2 )T . We wish to minimize the sum of weighted squared residuals: SSE = wi (yi - (θ0 + θ1 xi1 + θ2 xi2))2 . Derive the formula for the solution of θ which minimizes
the weighted sum of squared errors.
We wish to minimize the weighted sum of squared residuals: SSE = wi (yi - (θ0 + θ 1 xi1 + θ2 xi2))2 , where wi is the weight for the ith observation. Let W be a diagonal matrix with w1 , . . . , wn be the diagonal entries.
SSE = (Y - )t W (Y - )
= (Y - Xθ)t W (Y - Xθ)
= Yt WY - θtXt WY - Yt WXθ + θtXt WXθ = Yt WY - 2Yt WXθ + θtXt WXθ
(1)
Derivatives with respect to a vector follow the following rules which are very similar to the differentiation rule for one variable. Let X be a vector of variables of length n. Let Y be a vector of constants of length n. Let A be a constant matrix of dimension n by n.
= 2X. = Y. = 2AX.
∂SSE
∂θ
= ∂(Yt WY - 2Yt WXθ + θtXt WXθ)/∂θ
= -2Xt WY + 2(Xt WX)θ
∂SSE
0 =
∂θ
0 = ∂(Yt WY - 2Yt WXθ + θtXt WXθ)/∂θ
0 = -2Xt WY + 2(Xt WX)θ
θ = (Xt WX)-Xt WY
(2)
(3)
The solution is denoted as θˆ = (Xt WX)-Xt WY. This is the weighted least square solution to the regression model.
Question 5: Analyze the German data set from the site:
https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data).
a) Perform the logistic regression on the dataset. Build a predictive model using some of the predictors in the model. Please use 900 observations as the training set and use your model to predict the default status of the remaining 100 loans. Choose one of the regression coefficents and interpret the regression coefficient. What is the cutoff value of the probability do you use for your analysis? How many default ones are predicted to be non-default ones (number of false negative)? How many non-default ones are predicted to be default ones (number of false positive). Then you need to improve your model by adding more predictors or adding some higher order terms or interaction terms. Please demonstrate that your new model has lesser errors than your first model in the 100 testing cases.
index<-c(1,2,4,5,7,8,10,11,13,15,16,18,20,21)
for( i in index){
germancredit[,i]<-factor(germancredit[,i])
}
model<-glm(Default~duration+amount, data=germancredit[1:900,],family=binomial(link=logit > summary(model)
Call:
glm(formula = Default ~ duration + amount, family = binomial(link = logit),
data = germancredit[1:900, ])
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4431 -0.8599 -0.7219 1.3020 1.8121
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.636e+00 1.535e-01 -10.659 < 2e-16 ***
duration 3.399e-02 7.771e-03 4.374 1.22e-05 ***
amount 1.295e-05 3.301e-05 0.392 0.695
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1096.1 Residual deviance: 1059.5 AIC: 1065.5
on 899 on 897
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 4
####interpretation of the regression coefficients
###
exp( 1.295e-05)
[1] 1.000013
This means if the loan amount increases by 1 dollar, the odds of defaulting the mortageg
model1<-glm(Default~., data=germancredit[1:900,],family=
binomial(link=logit))
pred<-predict(model,newdata=germancredit[901:1000,],type="response")
predstatus<-pred>=1/2
tp<-sum((predstatus==1)*(germancredit[901:1000,]$Default==1))
tn<-sum((predstatus==0)*(germancredit[901:1000,]$Default==0))
fp<-sum((predstatus==1)*(germancredit[901:1000,]$Default==0))
fn<-sum((predstatus==0)*(germancredit[901:1000,]$Default==1))
> tp
[1] 5
> tn
[1] 66
> fp
[1] 2
> fn
[1] 27
###the number of false negative is 27 and the number of false positive is 2.
misrate<-(fp+fn)/100
sensitivity<-tp/(tp+fn)
specificity<-tn/(tn+fp)
########
> misrate
[1] 0.29
> sensitivity
[1] 0.15625
> specificity
[1] 0.9705882
#######
pred<-predict(model1,newdata=germancredit[901:1000,],type="response")
predstatus<-pred>=1/2
tp<-sum((predstatus==1)*(germancredit[901:1000,]$Default==1)) tn<-sum((predstatus==0)*(germancredit[901:1000,]$Default==0)) fp<-sum((predstatus==1)*(germancredit[901:1000,]$Default==0)) fn<-sum((predstatus==0)*(germancredit[901:1000,]$Default==1)) misrate<-(fp+fn)/100
sensitivity<-tp/(tp+fn)
specificity<-tn/(tn+fp)
######
> misrate
[1] 0.26
> sensitivity
[1] 0.5625
> specificity
[1] 0.8235294
######We note that the second model has better performance than the first model. ##If we change the cutoff to 1/6
predstatus<-pred>=1/6
tp<-sum((predstatus==1)*(germancredit[901:1000,]$Default==1))
tn<-sum((predstatus==0)*(germancredit[901:1000,]$Default==0))
fp<-sum((predstatus==1)*(germancredit[901:1000,]$Default==0))
fn<-sum((predstatus==0)*(germancredit[901:1000,]$Default==1))
misrate<-(fp+fn)/100
sensitivity<-tp/(tp+fn)
specificity<-tn/(tn+fp)
> misrate
[1] 0.39
> sensitivity
[1] 0.90625
> specificity
[1] 0.4705882
b) Please investigate how the sensitivity and specificity change with respect to the dif- ferent cutoff value of probability.
From the result above, we can see that if we lower the cutoff value, the sensitivity will in- crease, but the specivicity will decrease.
Question 6:
In logistic regression, we assume Y = (Y1 , . . . , Yn )T are a collection of n binary observa-
tions. For each Yi , we observe Xi = (Xi1, Xi2, Xi3)T predictors. We assume
log = Xi(T)θ,
where θ = (θ1 , . . . , θ3 )T is the vector of regression coefficients.
a) Formulate the overall loglikelihood of the dataset l(Y). l(Yi ) = {(1 - Yi ) log(1 - pi ) + Yi log pi }.
b) Derive the first derivative ∂l(Y)/∂θ2 .
Using chain rule, we can obtain ∂l(Y)/∂θ2 = (Yi - pi )Xi2 .
c)Derive the second derivative ∂2 l(Y)/∂θ2 ∂θ3 . ∂2 l(Y)/∂θ2 ∂θ3 = -pi (1 - pi )Xi2Xi3 .
Here I did not provide the detailed derivation here. For the derivation, please refer to our lecture notes on logistic regression.
d) Suppose θˆ = (0.1, 0.2, 0.3)T , and we have a new observation Xn+1 = (3, 2, 4)T . Predict the probability p of success for this new observation.
Xn(T)+1 θˆ = 3 * 0.1 + 2 * 0.2 + 4 * 0.3 = 1.9
p = exp(1.9)/(1 + exp(1.9) = 0.8698915.
2022-04-18