关键词 > STAT3064/STAT5061
STAT3064/STAT5061 Statistical Learning/Statistical Data Science Computer Lab 3
发布时间:2022-08-22
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT3064/STAT5061 Statistical Learning/Statistical Data Science
Computer Lab 3, Semester 2 2022
Things you may need to know/do.
● Relates to lecture 3.
● Libraries ggplot2, MASS, tidyverse, GGally may be useful. Others may be mentioned below in the hints.
● You might like to set up a project for each lab if you are using RStudio. Then you can copy a .Rmd file into that directory and write your answers in that file.
Before you run this code, remove eval = FALSE everywhere appropriate.
Start R
library (MASS)
library(tidyverse)
library(ggplot2)
library(GGally)
library( plot3D )
library( broom )
library( pls)
Q1
We consider the HDLSS Breast Tumour Gene Expression data of van’t Veer et.al. which we met in Lecture 3. We observed that one of the 78 observations appears to be an outlier. By restricting the number of variables to smaller subsets, we want to work out which groups of variables show up this observation as an outlier. We treat the genes as variables as in the lecture slides.
Use prcomp with the raw data for Q1. Note that this data set has no column names and is not a csv file despite the name. It seems to be tab-separated!
(a) Which observation number corresponds to the outlier? What did you compute to find it? (There is no unique or correct way of doing it!)
(b) We consider the five (sub)sets of the variables: all 4751 variables as in the analysis in the lecture, the
first 1024, the first 512, the first 256, and finally the first 128 variables only. You may want to create separate data frames for each subset of variables.
(c) For each of the five groups of variables and all 78 observations, conduct a PCA. Using the outcome of the PCA, find the rank of the sample covariance matrix, display the eigenvalue plots, the cumulative contribution to total variance plots and the PC score plots analogous to those shown on p19 and p20 respectively of the lecture slides.
(d) Compare the results of the different groups with the results shown in the lecture slides and comment.
What happened to the outlier? Did it move closer to the centre of the PC1/PC2 data?
(e) What do you expect to change if you leave out the outlying observations from the calculations of part (c)?
Q2
Now work with the full data of Q1 (we do not use the subsets of variables here). Scale the data and repeat part (c) of Q1 on the scaled data. Compare the results with those of Q1, with particular reference to the outlier.
You will need to use scale . = TRUE in prcomp.
Q3
We want to compare the performance of linear regression and PCR for the abalone data. We do this by
calculating the residual standard deviation (mis-named the Residual standard error in the lm summary)
and we study how this depends on the number of predictors for linear regression and the number of PC scores for PCR. Regard the variable Rings as the response, and all other variables apart from Sex as predictors.
For p = 1 to 7 calculate the residual standard deviation in the following cases:
(a) Based on the best p predictors in linear regression. You can do this with copying and pasting. Use the
add1 function and lm to do forward selection using the F-test to choose the next variable.
(b) Based on the first p PC scores, that is, the p-dimensional PC data, as predictors in PCR.
Plot residual standard deviation against number of predictors for the results of parts (a) and (b). Compare the results and comment.
Hints: We describe some of the regression commands and how we may use glance to gather and simplify the information we obtain.
For this question you require the library broom.
How to get data
Use col_names for ease of interpretation.
# Copied from the info file
coln = c (
"Sex" , # nominal
"Length" , # continuous
"Diameter" , # continuous mm
"Height" , # continuous mm
"Whole_weight" , # continuous
"Shucked_weight" , # continuous
"Viscera_weight" , # continuous
"Shell_weight" , # continuous
"Rings" # integer
)
abalone = read_csv ( file = "abalone/abalone .csv" , col_names = coln )
summary ( abalone )
summary ( factor( abalone$Sex ) )
Regression
# paste( coln, collapse = " + " )
# [1] "Sex + Length + Diameter + Height + Whole_ weight + Shucked_ weight + Viscera_ weight + Shell_ weight
big .lm = lm( Rings ~ Length + Diameter + Height + Whole_weight + Shucked_weight + Viscera_weight + She summary ( big .lm )
Forward selection
In forward selection, we start with a model that has no regressor variables (the null model) and add one regressor at the time. The null model is easily fitted:
fm .fwd <- fm .null <- lm(Rings ~ 1 , abalone)
The first argument to add1() is the model to which you want to add a term, the second argument (scope) is a formula that contains all terms that we are willing to add to our model. This scope is easily obtained from the model fm .full that we fitted above. If we want to use significant tests to guide us to decide which term to include, we have to specify that we want add1() to perform such significant tests:
add1 (fm .fwd, big .lm, test= "F")
summary ( lm( fm .fwd ) )
All p-values are extremely small.. We therefore look at which F statistics is the largest. This would suggest that we should first add the term Shell_weight to our model:
fm .fwd <- update(fm .fwd, . ~ . + Shell_weight)
summary ( lm( fm .fwd ) )
Start over
Now we know what is going on, we can start again with the null(intercept only) model and accumulate the results in glancerows as below.
glancerows = data .frame()
fm .fwd <- fm .null <- lm(Rings ~ 1 , abalone)
summary ( lm( fm .fwd ) )
row1 = data .frame(modelno = 0 , sigma = glance( lm( fm .fwd ) )$sigma )
row1
glancerows = rbind( glancerows, row1 )
Now we can use add1() command again to see which term to add next:
add1 (fm .fwd, big .lm, test= "F")
fm .fwd <- update(fm .fwd, . ~ . + Shell_weight)
row1 = data .frame(modelno = 1 , sigma = glance( lm( fm .fwd ) )$sigma )
row1
glancerows = rbind( glancerows, row1 )
add1 (fm .fwd, big .lm, test= "F")
For part (b) you could do something similar, copying and pasting, but more adventurous, and also feasible because you know the order to add variables (and therefore don’t have to look at the previous results in order to work out which to add next), is to write a loop to do this.
Recall from the simulation exercise of Lab 2 how to add successive rows to a data frame in a loop.