关键词 > MAST6029/MA6529

MAST6029/MA6529 STATISTICAL LEARNING 2021

发布时间：2022-05-18

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MAST6029/MA6529/2021

STATISTICAL LEARNING

2021

SECTION A

These questions will each be marked out of 10. Candidates may attempt all SIX questions but are advised that they cannot obtain a total of more than FIFTY MARKS on this section. Candidates must show their working on their scripts.

1. A study was conducted on exam marks achieved by school students on ﬁve subjects: Math- ematics, Physics, Economics, History and Geography. Factor analysis models with one and two factors were used to analyse the data and part of the output is given below.

Table 1: Factor loadings of the one-factor and two-factor models.

One Factor Model

Two Factor Model

Variable

Factor 1

Factor 1 Factor 2

Mathematics Physics Economics History Geography

0.53

0.37

0.82

0.92

0.99

0.27

0.11

0.70

0.99

0.89

Test of the hypothesis that 1 factor is sufficient.

The chi square statistic is 141.25 on 5 degrees of freedom.

The p-value is 9.68e-29

Test of the hypothesis that 2 factors are sufficient.

The chi square statistic is 1.3 on 1 degree of freedom.

The p-value is 0.254

(a) Interpret the model with two factors. [ 4 marks ]

(b) State the null and the alternative hypotheses of the tests reported in the R output.

[ 2 marks ]

(c) Explain what the R output indicates about the ﬁt of the one and the two factor models and how you would use this output to choose between a one factor and a two factor model for these data. [ 4 marks ]

2. Marks were recorded for a class of students in three tests: French (X1), German (X2) and Chemistry (X3). The sample correlation matrix for these three variables is

╱0.87(1) X3 ．0.49

0.41

(a) Calculate the partial correlation coeﬃcient between X1 and X3 given X2, and comment on the result. [ 5 marks ]

(b) Calculate the multiple correlation coeﬃcient between X3 and the other two variables (X1, X2 ) and comment on the result. You may use the fact that

\− 1 = \ .

3. A study on n = 37 children was designed to determine how well a set of paired-associate (PA) tasks predicted performance on the Raven Progressive Matrices test (RPMT), a student achievement test (SAT) and the Peabody Picture Vocabulary test (PPVT). The PA tasks varied in how the stimuli were presented, and are called “named” (n), “still” (s), “named still” (ns), “named action” (na), and “sentence still” (ss).

A canonical correlation analysis, studying the correlation between the PA tasks and the performance on the three tests, was performed on the data. The canonical correlation coeﬃcients obtained are 0.679, 0.362, and, 0.224, respectively. The corresponding canonical correlation vectors are

$xcoef

[,1]

-0.189 0.159 0.445

-0.890

ss -0.448

[,2]

0.827 0.142 -1.140 -0.319

0.409

[,3]

0.369

-0.399

0.600

-1.057

0.767

[,1] [,2] [,3]

RPMT -0.301 -0.877 -0.869

SAT -0.325 0.831 -0.695

PPVT -0.961 -0.979 0.160

(a) What is the purpose of canonical correlation analysis? [ 2 marks ]

(b) What does this analysis indicate about the relationship between the PA tasks and the

test performance of the children?

4. Given the following graph:

Answer the following questions:

(a) Is the graph complete? Justify your answer.

[ 8 marks ]

[ 2 marks ]

(b) List the maximal cliques of the graph and decompose the joint probability distribution of (X1, . . . , X5 ). [ 4 marks ]

(c) The random variables (X1, . . . , X5 ) are distributed according to a Gaussian graphical model with covariance matrix Σ and Markov structure encoded by the above graph. Explain what can be said about the elements of the inverse of the covariance matrix Σ. [ 4 marks ]

5. (a) Write down the probability density function of a Gaussian mixture model with three components with parameters α 1 = 0.3, α2 = 0.5, α3 = 0.2, µ 1 = 1, µ2 = −2, µ3 = −5, σ 1 = 1, σ2 = 0.5, σ3 = 1, where α4 , µ4 and σ4 are the weight, mean and standard deviation of the i-th component, i = 1, 2, 3, respectively. [ 4 marks ]

(b) The distribution of a set of observations (y1 , y2 , y3 ) = (1, 2, −4) is assumed to be a Gaussian mixture model with two components, with model parameters (α, µ1 , σ 1 , µ2 , σ2 ), where α is the weight of the ﬁrst component and µ4 , σ4 , are the mean and standard deviation of the i-th component, i = 1, 2, respectively.

(i) Write down the likelihood f(y1 , y2 , y3 ) of the observations as a function of the model parameters. [ 3 marks ]

(ii) We now introduce the variables (z1 , z2 , z3 ), such that z4 = 1 if observation y4 , i =

1, 2, 3, belongs to the ﬁrst component and 0 if it belongs to the second. Given the values (z1 , z2 , z3 ) = (1, 0, 1), write down the likelihood f(y1 , z1 , y2 , z2 , y3 , z3 ).

6. (a) Convert the following similarity matrix to a dissimilarity matrix. [ 2 marks ]

5.5

(b) Cluster variables {1, 2, 3, 4, 5} into the optimal conﬁguration of three clusters, using the cluster dendrogram shown in Figure 1.

Figure 1: Cluster dendrogram

[ 3 marks ]

(c) Assume that a set of observations has been allocated into three clusters using a model- based approach with cluster j , j = 1, 2, 3 described by a Gaussian distribution with mean µd and variance σd(2). The estimates of the cluster parameters are (µˆ1 , µˆ2 , µˆ3 , σˆ1 , σˆ2 , σˆ3) = (1, 3, 5, 1, 2, 1). Allocate observations y = 1 and y = 4 to one of the three clusters and justify your answer. [ 5 marks ]

SECTION B

These questions will each be marked out of 25. Candidates may not attempt more than TWO of the THREE questions. Candidates must show their working on their scripts.

7. Measurements were taken on Egyptian skulls from two diﬀerent epochs: c4000 BC and c200 BC. Samples of 30 skulls from each epoch were investigated and some statistics about the maximum breath and the nasal height of the skulls are provided. The means of the two measurements, within-group sample covariance matrices S1 (c4000 BC), S2 (c200 BC) and pooled sample covariance matrix S are presented below.

Epoch mean c4000 BC c200 BC

Maximum Breadth Nasal Height

131.4

50.5

S1 = \ , S2 = \ , S = \ ,

╱ \ ╱ \ ╱ \

(a) Test the null hypothesis that the vector of means of the c4000 BC epoch is equal to (130, 50) using a 5% signiﬁcance level and assuming an unknown variance covariance matrix. [ 6 marks ]

(b) Test the null hypothesis of equal vectors of means in the two epochs and clearly state your assumptions and conclusion. [ 6 marks ]

(c) Assuming the same covariance matrix for both epochs, use the above information to derive a linear discriminant function to distinguish between skulls from the two epochs.

[ 13 marks ]

8. Data on 300 types of pizza were collected, measuring the following variables

❼ mois: water per 100 grams in the sample

❼ prot: protein per 100 grams in the sample

❼ fat: fat per 100 grams in the sample

❼ sodium: sodium per 100 grams in the sample

❼ carb: carbohydrates per 100 grams in the sample

The sample variances of the variables are

mois prot fat sodium carb

91.260 41.401 80.562 0.137 325.071

The correlation between the variables is plotted in Figure 2.

mois

0.6

prot

0.2

fat

−0.2

sodium

−0.6

carb

−1

Figure 2: Correlation plot of the ﬁve pizza variables.

A principal component analysis was performed on the correlation matrix of the data set and the standard deviations of the principal components were obtained as

Standard deviations (1, .., p=5):

1.747 1.219 0.634 0.242 0.013

(a) Comment on the correlations between the variables shown in Figure 2 and in particular in relation to the aim of principal component analysis for this data set. [ 3 marks ]

(b) Explain why it is more appropriate to use the correlation matrix instead of the covariance matrix to perform the analysis in this case. What would you expect to happen in terms of the construction of the ﬁrst principal component if the covariance matrix was used instead?

[ 4 marks ]

(d) Interpret the biplot, shown in Figure 3, in terms of the constructed principal components

and of the individual

pizza scores.

[ 10 marks ]