关键词 > STAT3064/STAT5061

STAT3064/STAT5061 Statistical Learning/Statistical Data Science Computer Lab 3

发布时间：2022-08-22

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT3064/STAT5061 Statistical Learning/Statistical Data Science

Computer Lab 3, Semester 2 2022

Things you may need to know/do.

● Relates to lecture 3.

● Libraries ggplot2, MASS, tidyverse, GGally may be useful. Others may be mentioned below in the hints.

● You might like to set up a project for each lab if you are using RStudio. Then you can copy a .Rmd ﬁle into that directory and write your answers in that ﬁle.

Before you run this code, remove eval = FALSE everywhere appropriate.

Start R

library (MASS)

library(tidyverse)

library(ggplot2)

library(GGally)

library( plot3D )

library( broom )

library( pls)

We consider the HDLSS Breast Tumour Gene Expression data of van’t Veer et.al. which we met in Lecture 3. We observed that one of the 78 observations appears to be an outlier. By restricting the number of variables to smaller subsets, we want to work out which groups of variables show up this observation as an outlier. We treat the genes as variables as in the lecture slides.

Use prcomp with the raw data for Q1. Note that this data set has no column names and is not a csv ﬁle despite the name. It seems to be tab-separated!

(a) Which observation number corresponds to the outlier? What did you compute to ﬁnd it? (There is no unique or correct way of doing it!)

(b) We consider the ﬁve (sub)sets of the variables: all 4751 variables as in the analysis in the lecture, the

ﬁrst 1024, the ﬁrst 512, the ﬁrst 256, and ﬁnally the ﬁrst 128 variables only. You may want to create separate data frames for each subset of variables.

(c) For each of the ﬁve groups of variables and all 78 observations, conduct a PCA. Using the outcome of the PCA, ﬁnd the rank of the sample covariance matrix, display the eigenvalue plots, the cumulative contribution to total variance plots and the PC score plots analogous to those shown on p19 and p20 respectively of the lecture slides.

(d) Compare the results of the diﬀerent groups with the results shown in the lecture slides and comment.

What happened to the outlier? Did it move closer to the centre of the PC1/PC2 data?

(e) What do you expect to change if you leave out the outlying observations from the calculations of part (c)?

Now work with the full data of Q1 (we do not use the subsets of variables here). Scale the data and repeat part (c) of Q1 on the scaled data. Compare the results with those of Q1, with particular reference to the outlier.

You will need to use scale . = TRUE in prcomp.

We want to compare the performance of linear regression and PCR for the abalone data. We do this by

calculating the residual standard deviation (mis-named the Residual standard error in the lm summary)

and we study how this depends on the number of predictors for linear regression and the number of PC scores for PCR. Regard the variable Rings as the response, and all other variables apart from Sex as predictors.

For p = 1 to 7 calculate the residual standard deviation in the following cases:

(a) Based on the best p predictors in linear regression. You can do this with copying and pasting. Use the

add1 function and lm to do forward selection using the F-test to choose the next variable.

(b) Based on the ﬁrst p PC scores, that is, the p-dimensional PC data, as predictors in PCR.

Plot residual standard deviation against number of predictors for the results of parts (a) and (b). Compare the results and comment.

Hints: We describe some of the regression commands and how we may use glance to gather and simplify the information we obtain.

For this question you require the library broom.

How to get data

Use col_names for ease of interpretation.

# Copied from the info file

coln = c (

"Sex" , # nominal

"Length" , # continuous

"Diameter" , # continuous mm

"Height" , # continuous mm

"Whole_weight" , # continuous

"Shucked_weight" , # continuous

"Viscera_weight" , # continuous

"Shell_weight" , # continuous

"Rings" # integer

)

abalone = read_csv ( file = "abalone/abalone .csv" , col_names = coln )

summary ( abalone )

summary ( factor( abalone$Sex ) )

Regression

# paste( coln, collapse = " + " )

# [1] "Sex + Length + Diameter + Height + Whole_ weight + Shucked_ weight + Viscera_ weight + Shell_ weight

big .lm = lm( Rings ~ Length + Diameter + Height + Whole_weight + Shucked_weight + Viscera_weight + She summary ( big .lm )

Forward selection

In forward selection, we start with a model that has no regressor variables (the null model) and add one regressor at the time. The null model is easily ﬁtted:

fm .fwd <- fm .null <- lm(Rings ~ 1 , abalone)

The ﬁrst argument to add1() is the model to which you want to add a term, the second argument (scope) is a formula that contains all terms that we are willing to add to our model. This scope is easily obtained from the model fm .full that we ﬁtted above. If we want to use signiﬁcant tests to guide us to decide which term to include, we have to specify that we want add1() to perform such signiﬁcant tests:

add1 (fm .fwd, big .lm, test= "F")

summary ( lm( fm .fwd ) )

All p-values are extremely small.. We therefore look at which F statistics is the largest. This would suggest that we should ﬁrst add the term Shell_weight to our model:

fm .fwd <- update(fm .fwd, . ~ . + Shell_weight)

summary ( lm( fm .fwd ) )

Start over

Now we know what is going on, we can start again with the null(intercept only) model and accumulate the results in glancerows as below.

glancerows = data .frame()

fm .fwd <- fm .null <- lm(Rings ~ 1 , abalone)

summary ( lm( fm .fwd ) )

row1 = data .frame(modelno = 0 , sigma = glance( lm( fm .fwd ) )$sigma )

row1

glancerows = rbind( glancerows, row1 )

Now we can use add1() command again to see which term to add next:

add1 (fm .fwd, big .lm, test= "F")

fm .fwd <- update(fm .fwd, . ~ . + Shell_weight)

row1 = data .frame(modelno = 1 , sigma = glance( lm( fm .fwd ) )$sigma )

row1

glancerows = rbind( glancerows, row1 )

add1 (fm .fwd, big .lm, test= "F")

For part (b) you could do something similar, copying and pasting, but more adventurous, and also feasible because you know the order to add variables (and therefore don’t have to look at the previous results in order to work out which to add next), is to write a loop to do this.

Recall from the simulation exercise of Lab 2 how to add successive rows to a data frame in a loop.