STAT 4052 Introduction to Statistical Learning LAB I
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT 4052 LAB I
Reference: ISL BOOK Section 5.1, 5.3
Make Generation Reproducible
To make our random number generation reproducible, we set the seed of the generation algorithm. This is done with the set.seed function, which takes a user-selected integer argument. Let’s pick 4052 for this argument and set the seed:
set.seed(4052)
Cross Validation
In this lab, we consider a class of methods that estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations. We will go over three approaches of such to evaluate performance of statistical models/statistical learning methods.
1. Validation set approach: randomly divide the available set of observations into two parts, a training set and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate-typically assessed using MSE in the case of a quantitative response-provides an estimate of the test error rate.
2. Leave-One-Out Cross Validation.
3. K-folds Cross Validation.
We will use Auto Data set from the library ISLR. Load the data as follows
# install.packages("ISLR") |
|
|
|
|
|
library(ISLR) |
|
|
|
|
|
data(Auto) |
|
|
|
|
|
head(Auto) |
|
|
|
|
|
## mpg cylinders displacement |
horsepower |
weight |
acceleration |
year |
origin |
## 1 18 8 307 |
130 |
3504 |
12.0 |
70 |
1 |
## 2 15 8 350 |
165 |
3693 |
11.5 |
70 |
1 |
## 3 18 8 318 |
150 |
3436 |
11.0 |
70 |
1 |
## 4 16 8 304 |
150 |
3433 |
12.0 |
70 |
1 |
## 5 17 8 302 |
140 |
3449 |
10.5 |
70 |
1 |
## 6 15 8 429 ## name ## 1 chevrolet chevelle malibu ## 2 buick skylark 320 ## 3 plymouth satellite |
198 |
4341 |
10.0 |
70 |
1 |
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
We will compare the following regression models:
M1 : mpg = β0 + β1* horsepower + G
M2 : mpg = β0 + β1* horsepower + β2* horsepower2 + β3* horsepower3 + G
Validation set approach
We will first divide the data set into two separate datasets one for training and one for validation as follows:
set.seed(4052)
train=sample(392 ,196)
Auto_train=Auto[train,]
Auto_val=Auto[-train,]
Note that function set.seed is used to obtain reproducible results while using a function returning non-deterministic output. Next, we fit the required models using the training set and calculate an estimate for test MSE using the validation set as follows
#M1
lm.fit=lm(mpg~horsepower,data=Auto_train)
lm.pred=predict(lm.fit,Auto_val)
(val_err1=mean((Auto_val$mpg-lm.pred)^2))
## [1] 23.71675
#M2
lm.fit2=lm(mpg~poly(horsepower,3),data=Auto_train)
lm.pred2=predict(lm.fit2,Auto_val)
(val_err2=mean((Auto_val$mpg-lm.pred2)^2))
## [1] 20.4072
K-fold Cross Validation
In order to perform K-fold cross validation, we need to assign each data point to a fold at random. We will use the function createFolds in the package caret for this task as follows:
# install.packages("caret")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(4052)
nfolds=10 #Perform 10 fold cross validation
fold=createFolds (1:392 ,k=nfolds,list=FALSE)
fold #check if assignment looks random
## [1] 6 2 8 7 8 3 9 10 7 3 5 3 8 3 8 5 6 7 10 4 10 8 5 4 6 ## [26] 7 4 6 7 8 6 1 8 9 5 1 10 1 8 4 6 3 10 5 8 4 1 9 7 4 ## [51] 6 2 3 5 9 9 9 3 9 4 10 10 3 7 10 1 9 3 10 2 9 7 1 4 9 ## [76] 8 6 2 5 5 5 7 4 2 3 1 1 4 7 6 2 1 10 2 2 2 6 1 5 1 ## [101] 8 7 10 3 8 2 6 4 7 6 4 8 7 3 1 8 10 3 5 8 1 7 5 1 7 ## [126] 9 3 10 2 8 10 5 5 10 3 9 10 3 8 2 3 4 6 6 10 1 2 3 1 1 ## [151] 9 4 2 4 4 2 6 9 2 3 4 1 7 8 10 6 5 10 4 4 7 4 6 6 9 ## [176] 2 9 9 7 5 8 5 6 7 9 6 10 9 2 3 9 1 5 2 5 7 9 6 2 9 ## [201] 1 7 2 3 3 10 6 7 2 4 2 2 8 10 8 9 7 6 9 5 4 4 4 5 5 ## [226] 10 2 5 7 1 6 9 3 1 1 10 5 8 9 3 6 10 8 6 2 3 4 1 9 1 ## [251] 4 2 10 3 9 6 9 10 4 5 10 1 7 6 8 1 5 8 9 4 3 8 8 8 1 ## [276] 4 7 5 10 2 6 2 7 10 5 3 7 3 6 7 3 4 5 7 8 2 2 8 1 6 ## [301] 2 9 9 6 5 4 6 9 6 1 6 5 7 4 10 2 4 5 1 7 6 7 10 1 10 ## [326] 3 6 9 8 8 7 6 5 1 10 4 10 7 4 8 8 7 3 2 7 1 10 3 5 9 ## [351] 1 4 6 8 7 3 1 4 10 10 9 7 9 5 4 1 3 5 3 3 2 3 1 2 2 ## [376] 2 4 3 5 7 2 8 10 6 5 9 10 3 4 8 5 9
#Number of data points in each fold is roughly equal
table(fold)
## fold
## 1 2 3 4 5 6 7 8 9 10
## 38 39 40 40 39 40 40 37 39 40
Now, we run a loop to train the model using all the data points, but those in one fold. We calculate the estimate for test MSE using the rest. In the end, we take an average of test MSE estimates for each fold.
#initialize error to 0
kCV_err1=0
kCV_err2=0
for(i in 1:nfolds)
{
m1=lm(mpg~horsepower,data=Auto[fold!=i,])
m2=lm(mpg~poly(horsepower,3),data=Auto[fold!=i,])
pred1=predict(m1,Auto[fold==i,])
pred2=predict(m2,Auto[fold==i,])
kCV_err1=kCV_err1 + mean((Auto$mpg[fold==i]-pred1)^2)
kCV_err2=kCV_err2 + mean((Auto$mpg[fold==i]-pred2)^2)
}
(kCV_err1=kCV_err1/nfolds)
## [1] 24.15139
(kCV_err2=kCV_err2/nfolds)
## [1] 19.19001
Leave-One-Out Cross Validation
For Leave-One-Out Cross Validation, we do not require to make folds as earlier, we just run a loop over all the data points leaving one point out of the training set at each step and calculating the estimate using the remaining point. In the end, we take an average of all the estimates.
#Initialize error to 0
LOO_err1=0
LOO_err2=0
for (i in 1:392)
{
m1=lm(mpg~horsepower,data=Auto[-i,])
m2=lm(mpg~poly(horsepower,3),data=Auto[-i,])
pred1=predict(m1,Auto[i,])
pred2=predict(m2,Auto[i,])
LOO_err1=LOO_err1 + (Auto$mpg[i]-pred1)^2
LOO_err2=LOO_err2 + (Auto$mpg[i]-pred2)^2
}
(LOO_err1=LOO_err1/392)
## 1
## 24.23151
(LOO_err2=LOO_err2/392)
## 1
## 19.33498
Using cv.glm for K-fold Cross Validation
For regression models, the function cv.glm in the package boot allows to calculate the K-fold cross validation error without manually writing a loop. It can be used as follows:
library(boot)
##
## Attaching package: ’boot’
## The following object is masked from ’package:lattice’:
##
## melanoma
glm.fit1=glm(mpg~horsepower, data=Auto)
cv.err1=cv.glm(Auto,glm.fit1,K=10)
cv.err1$delta[1]
## [1] 24.21377
glm.fit2=glm(mpg~poly(horsepower,3), data=Auto)
cv.err2=cv.glm(Auto,glm.fit2,K=10)
cv.err2$delta[1]
## [1] 19.57233
Note that the function cv.glm takes three arguments: the dataset, an object of class glm(a model fitted using the function glm not lm) and K. The function returns a list, the CV error is returned as the first element in the component called delta. Since Leave-One-Out Cross Validation is a special case of K-fold Cross Validation, LOO error can also be obtained using cv.glm.
#default value of K is equal to number of data points in the dataset LOO.err1=cv.glm(Auto,glm.fit1,K=392)
LOO.err1$delta[1]
## [1] 24.23151
LOO.err2=cv.glm(Auto,glm.fit2)
LOO.err2$delta[1]
## [1] 19.33498
2024-01-27