闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT 4052 LAB I

Reference: ISL BOOK Section 5.1, 5.3

Make Generation Reproducible

To make our random number generation reproducible, we set the seed of the generation algorithm. This is done with the set.seed function, which takes a user-selected integer argument. Let’s pick 4052 for this argument and set the seed:

set.seed(4052)

Cross Validation

In this lab, we consider a class of methods that estimate the test error rate by holding out a subset of the training observations from the ﬁtting process, and then applying the statistical learning method to those held out observations. We will go over three approaches of such to evaluate performance of statistical models/statistical learning methods.

1. Validation set approach: randomly divide the available set of observations into two parts, a training set and a validation set or hold-out set. The model is ﬁt on the training set, and the ﬁtted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate-typically assessed using MSE in the case of a quantitative response-provides an estimate of the test error rate.

2. Leave-One-Out Cross Validation.

3. K-folds Cross Validation.

We will use Auto Data set from the library ISLR. Load the data as follows

# install.packages("ISLR")
library(ISLR)
data(Auto)
head(Auto)
## mpg cylinders displacement	horsepower	weight	acceleration	year	origin
## 1 18 8 307	130	3504	12.0	70	1
## 2 15 8 350	165	3693	11.5	70	1
## 3 18 8 318	150	3436	11.0	70	1
## 4 16 8 304	150	3433	12.0	70	1
## 5 17 8 302	140	3449	10.5	70	1
## 6 15 8 429 ## name ## 1 chevrolet chevelle malibu ## 2 buick skylark 320 ## 3 plymouth satellite	198	4341	10.0	70	1

## 4 amc rebel sst

## 5 ford torino

## 6 ford galaxie 500

We will compare the following regression models:

M1 : mpg = β0 + β1* horsepower + G

M2 : mpg = β0 + β1* horsepower + β2* horsepower2 + β3* horsepower3 + G

Validation set approach

We will ﬁrst divide the data set into two separate datasets one for training and one for validation as follows:

set.seed(4052)

train=sample(392 ,196)

Auto_train=Auto[train,]

Auto_val=Auto[-train,]

Note that function set.seed is used to obtain reproducible results while using a function returning non-deterministic output. Next, we ﬁt the required models using the training set and calculate an estimate for test MSE using the validation set as follows

#M1

lm.fit=lm(mpg~horsepower,data=Auto_train)

lm.pred=predict(lm.fit,Auto_val)

(val_err1=mean((Auto_val$mpg-lm.pred)^2))

## [1] 23.71675

#M2

lm.fit2=lm(mpg~poly(horsepower,3),data=Auto_train)

lm.pred2=predict(lm.fit2,Auto_val)

(val_err2=mean((Auto_val$mpg-lm.pred2)^2))

## [1] 20.4072

K-fold Cross Validation

In order to perform K-fold cross validation, we need to assign each data point to a fold at random. We will use the function createFolds in the package caret for this task as follows:

# install.packages("caret")

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(4052)

nfolds=10 #Perform 10 fold cross validation

fold=createFolds (1:392 ,k=nfolds,list=FALSE)

fold #check if assignment looks random

## [1] 6 2 8 7 8 3 9 10 7 3 5 3 8 3 8 5 6 7 10 4 10 8 5 4 6 ## [26] 7 4 6 7 8 6 1 8 9 5 1 10 1 8 4 6 3 10 5 8 4 1 9 7 4 ## [51] 6 2 3 5 9 9 9 3 9 4 10 10 3 7 10 1 9 3 10 2 9 7 1 4 9 ## [76] 8 6 2 5 5 5 7 4 2 3 1 1 4 7 6 2 1 10 2 2 2 6 1 5 1 ## [101] 8 7 10 3 8 2 6 4 7 6 4 8 7 3 1 8 10 3 5 8 1 7 5 1 7 ## [126] 9 3 10 2 8 10 5 5 10 3 9 10 3 8 2 3 4 6 6 10 1 2 3 1 1 ## [151] 9 4 2 4 4 2 6 9 2 3 4 1 7 8 10 6 5 10 4 4 7 4 6 6 9 ## [176] 2 9 9 7 5 8 5 6 7 9 6 10 9 2 3 9 1 5 2 5 7 9 6 2 9 ## [201] 1 7 2 3 3 10 6 7 2 4 2 2 8 10 8 9 7 6 9 5 4 4 4 5 5 ## [226] 10 2 5 7 1 6 9 3 1 1 10 5 8 9 3 6 10 8 6 2 3 4 1 9 1 ## [251] 4 2 10 3 9 6 9 10 4 5 10 1 7 6 8 1 5 8 9 4 3 8 8 8 1 ## [276] 4 7 5 10 2 6 2 7 10 5 3 7 3 6 7 3 4 5 7 8 2 2 8 1 6 ## [301] 2 9 9 6 5 4 6 9 6 1 6 5 7 4 10 2 4 5 1 7 6 7 10 1 10 ## [326] 3 6 9 8 8 7 6 5 1 10 4 10 7 4 8 8 7 3 2 7 1 10 3 5 9 ## [351] 1 4 6 8 7 3 1 4 10 10 9 7 9 5 4 1 3 5 3 3 2 3 1 2 2 ## [376] 2 4 3 5 7 2 8 10 6 5 9 10 3 4 8 5 9

#Number of data points in each fold is roughly equal

table(fold)

## fold

## 1 2 3 4 5 6 7 8 9 10

## 38 39 40 40 39 40 40 37 39 40

Now, we run a loop to train the model using all the data points, but those in one fold. We calculate the estimate for test MSE using the rest. In the end, we take an average of test MSE estimates for each fold.

#initialize error to 0

kCV_err1=0

kCV_err2=0

for(i in 1:nfolds)

{

m1=lm(mpg~horsepower,data=Auto[fold!=i,])

m2=lm(mpg~poly(horsepower,3),data=Auto[fold!=i,])

pred1=predict(m1,Auto[fold==i,])

pred2=predict(m2,Auto[fold==i,])

kCV_err1=kCV_err1 + mean((Auto$mpg[fold==i]-pred1)^2)

kCV_err2=kCV_err2 + mean((Auto$mpg[fold==i]-pred2)^2)

}

(kCV_err1=kCV_err1/nfolds)

## [1] 24.15139

(kCV_err2=kCV_err2/nfolds)

## [1] 19.19001

Leave-One-Out Cross Validation

For Leave-One-Out Cross Validation, we do not require to make folds as earlier, we just run a loop over all the data points leaving one point out of the training set at each step and calculating the estimate using the remaining point. In the end, we take an average of all the estimates.

#Initialize error to 0

LOO_err1=0

LOO_err2=0

for (i in 1:392)

{

m1=lm(mpg~horsepower,data=Auto[-i,])

m2=lm(mpg~poly(horsepower,3),data=Auto[-i,])

pred1=predict(m1,Auto[i,])

pred2=predict(m2,Auto[i,])

LOO_err1=LOO_err1 + (Auto$mpg[i]-pred1)^2

LOO_err2=LOO_err2 + (Auto$mpg[i]-pred2)^2

}

(LOO_err1=LOO_err1/392)

## 1

## 24.23151

(LOO_err2=LOO_err2/392)

## 1

## 19.33498

Using cv.glm for K-fold Cross Validation

For regression models, the function cv.glm in the package boot allows to calculate the K-fold cross validation error without manually writing a loop. It can be used as follows:

library(boot)

## Attaching package: ’boot’

## The following object is masked from ’package:lattice’:

## melanoma

glm.fit1=glm(mpg~horsepower, data=Auto)

cv.err1=cv.glm(Auto,glm.fit1,K=10)

cv.err1$delta[1]

## [1] 24.21377

glm.fit2=glm(mpg~poly(horsepower,3), data=Auto)

cv.err2=cv.glm(Auto,glm.fit2,K=10)

cv.err2$delta[1]

## [1] 19.57233

Note that the function cv.glm takes three arguments: the dataset, an object of class glm(a model ﬁtted using the function glm not lm) and K. The function returns a list, the CV error is returned as the ﬁrst element in the component called delta. Since Leave-One-Out Cross Validation is a special case of K-fold Cross Validation, LOO error can also be obtained using cv.glm.

#default value of K is equal to number of data points in the dataset LOO.err1=cv.glm(Auto,glm.fit1,K=392)

LOO.err1$delta[1]

## [1] 24.23151

LOO.err2=cv.glm(Auto,glm.fit2)

LOO.err2$delta[1]

## [1] 19.33498

2024-01-27

Java

物理(Physical)

LINUX

C++

Python