关键词 > STAT3064/STAT5061

STAT3064/STAT5061 StatisticalLearning/Statistical Data Science Computer Lab 4

发布时间：2022-09-06

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT3064/STAT5061 StatisticalLearning/Statistical Data Science

Computer Lab 4, Semester 2 2022

Things you may need to know/do.

● Relates to lecture 4.

● Libraries ggplot2, MASS, tidyverse, GGally may be useful. Others may be mentioned below in the hints.

● You might like to set up a project for each lab if you are using RStudio. Then you can copy a .Rmd ﬁle into that directory and write your answers in that ﬁle.

Consider the aeroplane data. The data are available in the Data Sets folder. Use the six logged variables as in previous labs. Scale the logged variables and work with the six scaled logged variables in this lab.

a) Calculate the factor loadings based on the principal component factor analysis approach, and extract the ﬁrst two and ﬁve factor loadings. (Hint: see p14 of the lecture slides.) Recall that the factor loadings are calculated from the eigenvectors of the covariance matrix the eigenvalues as shown on p14.)

b) Use varimax to calculate the VC optimal factor loadings A_{VC,2} and A_{VC,5} for the 2-factor model and the 5-factor model respectively.

c) Display the unrotated or plain factor loadings A_2 of part (a), the VC optimal factor loading A_{VC,2}, and the ﬁrst two components of the A_{VC,5} in biplots.

d) List the factor loadings you have computed.

e) Compare the results of parts (b)– (d) and comment on similarities and diﬀerences.

Consider the 13-dimensional wine recognition data. The data are available in the Data Sets folder. Ignore the ﬁrst column for the calculations in this lab, as these are the labels, and note that the ﬁle is tsv (not csv).

Read the data into R.

a) Scale the data and work with the scaled data in this question.

b) Calculate a two-factor model based on the PC approach and show the result in a biplot.

c) Calculate the sample covariance matrix of the scaled data and the eigenvalues of this matrix. For k = 2, calculate 2 as in Box 6.7 and list this value.

We compute

2 = j

.(Here the λ are the eigenvalues of the correlation matrix.)

d) Calculate the sample equivalent of xA (see p14 right branch) for the scaled data and list the matrix.

e) In this part you may follow the code chunk provided to see what happens when you use a particular value of 2 as calculated in part (c). This code calculates the factor loadings (A in the right hand branch) for the 2-factor principal axis factoring using the value of 2 calculated in part (c). You don’t need to do anything apart from commenting on the diﬀerence in the results of parts (b) and (e). In particular look at the eigenvalues of xA and comment.

Code chunk for (d) and (e)

Notation: We compute A speciﬁcally, using the scaled data. Use the right branch of p14. Write S for the covariance matrix of the scaled data, corresponding to x, Om for o and S_A for xA .

The loadings will be in the matrix A. We set k = 2.

(Don’t forget to take out the eval = FALSE.)

S = cov ( wine0 ) # w4neo 4s the sa-lec w4ne c- t -

Om = diag( rep ( sigma_hat_sq, 13 ) ) # s4gm- h- t sg 4s u- lue a- laul - tec 4n (a之

S_A = S - Om

S_A

Use eigen on the matrix S_A.

eig_A = eigen ( S_A )

eig_A

Gamma_hat_2 = eig_A$vectors[ ,1:2]

Gamma_hat_2

Lambda_hat_2 = diag( eig_A$values[ 1:2] ˆ(1/2) )

Lambda_hat_2

Ahat = Gamma_hat_2 %*% Lambda_hat_2

Ahat

biplot( PCA.wine$x, Ahat, col = c ("white" , "blue") )

biplot( PCA.wine$x, A[,1:2], col = c ("white" , "blue") )

Consider the scaled data of Q2. We calculate diﬀerent ML-based 2-factor models which we want to compare. (Hint: use the factanal function.)

a) Calculate ‘plain’ ML factor loadings, the varimax optimal orthogonal factor loadings and the varimax optimal oblique factor loadings (using promax) and compare the results of the factor loadings and biplots. (You will need scores = "regression" in the factanal function in order to get the scores for the biplots. See below.) Comment on the results.

b) Carry out a sequence of hypothesis tests for the number of factors in the model, starting with the one-factor model.

c) For each k _ kmAz , state the number of degrees of freedom of the chi-square distribution, which represents the approximate distribution of s2 log LRk , the p-value.

d) What is the appropriate k for these data? Use a signiﬁcance level of 0.01. And repeat with a signiﬁcance level of 0.10.

Hint:

fa .02 = factanal( wine0, 2, rotation = "none", scores = "regression" ) # this is the k=2 case, plain option

biplot( fa .02$scores, fa .02$loadings, col = c("white", "blue") )