闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MATH 3333 3.0 - Winter 2021-2022

Assignment 1

For all the questions involving R programming, please submit your R code and attach the screenshot of the R output.

Question 1: Please download the bike sharing data set from the following website: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset. (I also include the data set on crowdmark.) This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and sea- sonal information. There are two data sets and we will use the day.csv. Import the data into R and perform the following exploratory analysis.

a) The variable ”registered” records the number of registered users used the bike sharing service on a particular day. Please provide the mean value of the variable ”registered” for each day of the week.

b) Plot the conditional density plot of the variable ”registered” conditional on each month of the year.

c) Produce a two-dimensional levelplot of the variable ”registered” against the combina- tion of temperature (variable ”temp”) and humidity (variable ”hum”).

day<-read.csv("C:\\Users\\xingao\\Documents\\teaching\\winter2022\\math3333\\day.csv") tapply(day$registered, day$weekday,mean)

0 1 2 3 4 5 6

2890.533 3663.990 3954.481 3997.394 4076.298 3938.000 3085.286

densityplot(~registered|month,data=day,layout=c(1,12),

plot.points=FALSE)

t6=tapply(day$registered,INDEX=list(cut(day$temp,breaks=10),

cut(day$hum,breaks=10)),FUN=mean,na.rm=TRUE)

levelplot(t6,scales = list(x = list(rot = 90)))

(note: For density plot, if the students plot all the densities on the same panel instead of 12 panels, that is still correct. Please give full mark to diﬀerent ways of plotting densities.) Question 2: Perform linear regression model on the bike sharing data set from Question 2.

a. Provide the summary result of the regression model with ”registered” as the response variable and ”temp”, ”hum” as the predictors.

b. What other predictors do you think might be important for the modelling of the variable ”registered”? Please construct another linear model including more predictors and provide the summary result of the second model.

c. Perform 100 times of 5-fold cross validation and each time you randomly partition the dataset into ﬁve equal parts. You will use 80% of the data as the training data and 20% as the validating data. For models in a) and b), calculate the total sum of squared prediction error divived by the size of the validation data and by the number

of cross-validations. Which model has better predictive power?

model<-lm(registered~temp+hum,data=day)

summary(model)

Call:

lm(formula = registered ~ temp + hum, data = day)

Residuals:

Min 1Q Median 3Q Max

-3687.4 -994.5 -177.7 968.0 3170.3

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) temp

hum

2405.1

4778.5

-1777.6

239.4 10.046 < 2e-16 ***

263.1 18.162 < 2e-16 ***

338.1 -5.257 1.93e-07 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1291 on 728 degrees of freedom

Multiple R-squared: 0.3175,Adjusted R-squared: 0.3156

F-statistic: 169.3 on 2 and 728 DF, p-value: < 2.2e-16

day$weekday<-factor(day$weekday)

model2<-lm(registered~temp+hum+weekday,data=day)

summary(model2)

Call:

lm(formula = registered ~ temp + hum + weekday, data = day)

Residuals:

Min 1Q Median 3Q Max

-4053.4 -991.5 -35.5 851.8 2842.9

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1754.1 253.3 6.925 9.66e-12 ***

temp 4677.5 251.0 18.633 < 2e-16 ***

hum -1793.6 323.6 -5.543 4.18e-08 ***

weekday1 745.3 169.8 4.389 1.31e-05 ***

weekday2 992.8 170.3 5.830 8.37e-09 ***

weekday3 1040.4 170.3 6.108 1.65e-09 ***

weekday4 1056.3 170.4 6.200 9.50e-10 ***

weekday5 966.6 170.3 5.677 1.99e-08 ***

weekday6 187.7 169.8 1.105 0.269

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1230 on 722 degrees of freedom

Multiple R-squared: 0.3853,Adjusted R-squared: 0.3785

F-statistic: 56.57 on 8 and 722 DF, p-value: < 2.2e-16

The new model has much higher rsquare.

c) As total number of observation is 731, we have 731*0.8=584.8. So we are going to have 585 obs in training and remaining ones in testing.

crossvalidationerror1<-0

for (i in 1:100){

trainingID<-sample(1:731,585)

testid<-(1:731)[-trainingID]

model1<-lm(registered~temp+hum,data=day[trainingID,])

x<-day[testid,]$registered-predict(model1,day[testid,])

crossvalidationerror1<-crossvalidationerror1+sum(x^2)

}

crossvalidationerror2<-0

MSEcv1<-

for (i in 1:100){

trainingID<-sample(1:731,585)

testid<-(1:731)[-trainingID]

model2<-lm(registered~temp+hum+weekday,data=day[trainingID,]) x<-day[testid,]$registered-predict(model2,day[testid,]) crossvalidationerror2<-crossvalidationerror2+sum(x^2)

}

crossvalidationerror1/100

239848195

crossvalidationerror2/100

225411404

The second model has much smaller crossvalidation errors.

(note: for random partitioning, there are diﬀerent ways of implimentations. All methods that are correct will be given full mark. For calculating crossvalidation errors, diﬀerent from the sample codes here, we can alternatively use the command predict.lm. This will be given full mark as well. )

Question 3: In the following marketing set, we have 9 years with the sales in 10 million euro and the advertising expenditure in million euro.

Year	Sales	Advertisement
1	34	23
2	56	36
3	65	45
4	86	52
5	109	53
6	109	58
7	122	63
8	124	68
9	131	70

a) Formulate the response vector y. which has nine entries.

╱．5(3)6(4) 、．

．．

．173(7)71．

b) Formulate the data matrix of X, the ﬁrst column should be all ones corresponding to the intercept, and the second column should be the predictors. The dimension of X should be

9 一 27

╱ 1(1)

．71(7)7

c) Write R code to compute XtX7

3(2)6(3)、．

7770(7)．

xmat<-cbind(rep(1,9),c(23,36,45,52,53,58,63,68,70))

t(xmat)%*%xmat

[1,]

[2,]

[,1] [,2] 9 468 468 26220

d) Write R code to compute i = (XtX)− Xt y7 This is the estimated linear regression coeﬃcient of the linear model with Y as the response and X as the data matrix.

y<-c(34,56,65,86,109,109,122,124,131)

solve(t(xmat)%*%xmat)%*%(t(xmat)%*%y)

[1,]

[2,]

[,1] -20.550602 2.181529

e)Run the linear regression using Y as the response and X as the predictor using lm command in R and compare the output with your own calculation.

res<-lm(y~xmat[,2])

Call:

lm(formula = y ~ xmat[, 2])

Coefficients:

(Intercept) xmat[, 2]

-20.551 2.182

The r output agrees with result in part d.

f) Now two additional data points arrived. They are Year 10, Sales 96, and Advertisement 53; Year 11, Sales 107, and Advertisement 63. Please use the online algorithm to update the linear model. Use the two new observations together to perform the sequential learning and update the model using stochastic gradient descent algorithm using the learning rate 9 = 0701. Note here in the updating scheme iˆnew = iˆold ← 9|En . the term En will be the sum of the squared prediction errors over the two new observations:

En = (夕10 ← 夕ˆ10 )2 + (夕11 ← 夕ˆ11 )2 7

In this example, we implement the online algorithm when the new data come in batches, not one at a time.

夕10 ← 夕ˆ10 = 96 ← ( ←207551 + 27182 ( 53) = 07905.

夕11 ← 夕ˆ11 = 107 ← ( ←207551 + 27182 ( 63) = ←97915.

|En = ←2(夕10 ← 夕ˆ10 ) ( ╱ 53(1)、 ← 2(夕11 ← 夕ˆ11 ) ( ╱63(1)、= 、

So the updated estimate is

、 ← 0701 、= 、7

Question 4: Consider a data set with the response vector y = (夕1 . 7 7 7 . 夕n )t and the data

matrix

╱ 1 α 11 α 12 、

．． 7

． 1(7) αn(7)1 αn(7)2 ．

We model the relationship between X and Y using the linear regression model: 夕i = i0 + i1 αi1 + i2 αi2 + ∈i . i = 1. 7 7 7 . n. where ∈ ì N (0. A2 )7 Let the parameter vector be denoted as i = (i0 . i1 . i2 )t 7 We wish to minimize the sum of squared residuals: ssE = (夕i ← (i0 + i1 αi1 + i2 αi2))2 7 Let the ﬁtted value be denoted as 夕ˆi = i0 + i1 αi1 + i2 αi2 .

and let the ﬁtted value vector be denoted as .

a) Show that ssE = (夕i ← 夕ˆi )2 7

b) Show that ssE = (y ← )t (y ← )7

c) Show that = Xi7

d) Simplify the derivative equation = 07

e) Find the solution of i which solves the equation in part d. ssE = (夕i ← (i0 + i1 αi1 + i2 αi2))2 7

ssE = (夕i ← 夕ˆi )2 7

Since each

夕ˆi = (1. αi1. αi2) ╱i(i)2(1)、

we have = Xi7

ssE = (y ← )t (y ← )

= (y ← Xi)t (y ← Xi)

= yt y ← itXt y ← ytXi + itXtXi

= yt y ← 2ytXi + itXtXi

assE

= a(yt y ← 2ytXi + itXtXi)/ai

= ←2Xt y + 2(XtX)i

(1)

(2)

assE

0 =

0 = a(yt y ← 2ytXi + itXtXi)/ai

0 = ←2Xt y + 2(XtX)i

i = (XtX)− Xt y

(3)

The solution is denoted as iˆ = (XtX)− Xt y7 This is the famous least square solution to the regression model.

Question 5: Analyze the QSAR ﬁsh toxicity data set from the site:

https://archive.ics.uci.edu/ml/datasets/QSAR+ﬁsh+toxicity. Perform variable selection on the six predictors using the lasso package.

a) Based on the output of “lars”, please provide the sequence of candidate models. For example, the ﬁrst model is ≠X5 礻, the second model is ≠X5 . X3 礻 and the third model is ≠X5 . X3 . X10 礻. etc.

lasso <-lars(x=as.matrix(qsar_fish_toxicity[,1:6]),y=as.numeric(qsar_fish_toxicity[,7]),

LASSO sequence

Computing X’X .....

LARS Step 1 : Variable 6 added

LARS Step 2 : Variable 2 added

LARS Step 3 : Variable 3 added

LARS Step 4 : Variable 4 added

LARS Step 5 : Variable 5 added

LARS Step 6 : Variable 1 added

Computing residuals, RSS etc .....

Based on the output, we know the sequence of models are: ≠X6 礻, ≠X6 . X2 礻, ≠X6 . X2 . X3 礻, ≠X6 . X2 . X3 . X4 礻, ≠X6 . X2 . X3 . X4 . X5 礻, ≠X6 . X2 . X3 . X4 . X5 . X1 礻7 (note: some students im- ported the data set and the variables are called V1-V7. This is ok and it won’t aﬀect the grading.)

b) Use the cross validation method, select the best value for the fraction 5 based on the plot of cross validation error againt the fraction 57 The fraction 5 measures the ratio of the L1 norm of the penalized estimate over the L1 norm of the regular penalized estimate.

Based on the cross validation plot, we see that when s=0.8 to 1.0, the cv error attains the minimum level. So we choose s=0.8.

(Note: some students used my code in the class and simulated 90-100 more unimportant variables. The optimum s will be diﬀerent from my answer here. But it will be also correct. )

c) Use the optimum 5 you select, perform the penalized regression and output the opti- mum model and the estimated coeﬃcients.

coef(lasso,s=0.8,mode="fraction")

> coef(lasso,s=0.8,mode="fraction")

V1 V2 V3 V4

0.2077567 0.9921246 -0.4651972 0.2890276

V5 0.0354972

0.4324455

The result above is the optimum model with the coeﬃcients. It is observed that all the variables are actually included in the model. (note: some students worked with the bigger data set with an extra collection of 90-100 variables. So the result will be slightly diﬀerent from my solution here. But the result is still correct.)

Question 6: Simulate a data set with 100 observations 夕i = 30 + 5α + 2α2 + 3α3 + ∈i . where ∈i follows independent normal distribution N (0. 1)7

a) Perform polynomial regression on your simulated data set and using α. I(α2 ). I(α3 ) as the predictors. Compare the estimated coeﬃcients with the true model and report the R-square.

b) Formulate the design matrix of this regression and write down the ﬁrst two rows of the design matrix based on your data set.

c) Perform polynomial regression on your simulated data set and using α. I(α2 ). I(α3 ). I(α4 ) as the predictors. Compare the estimated coeﬃcients with the true model and report the R- square. Is the R-square increased or decreased compared to the model in part (a)? Explain why?

x <- seq(1:100)

y <- 30+5*x+2*x^2+3*x^3

noise <- rnorm(length(x), mean=0, sd=1)

noisy.y <- y + noise

model <- lm(noisy.y ~ x + I(x^2) + I(x^3))

summary(model)

> summary(model)

Call:

lm(formula = noisy.y ~ x + I(x^2) + I(x^3))

Residuals:

Min 1Q Median 3Q Max

-2.43154 -0.54062 -0.03222 0.62901 2.88750

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.995e+01 4.269e-01 70.16 <2e-16 ***

x 5.025e+00 3.643e-02 137.95 <2e-16 ***

I(x^2) 1.999e+00 8.358e-04 2392.31 <2e-16 ***

Residual standard error: 1.028 on 96 degrees of freedom Multiple R-squared: 1,Adjusted R-squared: 1

F-statistic: 2.388e+13 on 3 and 96 DF, p-value: < 2.2e-16

In comparison with the true coeﬃcients, the slope of x is 5 and estimated to be 5.025, the slope of α2 is 2, estimated to be 1.99, the slope of α3 is 3 and estimated to be 3. The R2 is 1.