关键词 > STAT3017/6017
STAT3017/6017 - Big Data Statistics - Assessment 5 2023
发布时间:2023-10-27
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT3017/6017 - Big Data Statistics - Assessment 5 2023
Assessment 5
Due by Thursday 3 November 2023 09:00
[Total Marks: 26 (STAT6017) / 20 (STAT3017)]
Question 1 [12 marks]
Consider two p-dimensional populations with covariance matrices Ip and (Ip + ∆) where
∆ := diag(δ1 ,δ2 , 0, . . . , 0)
with δ1 ,δ2 ∈ R. Suppose we had p-dimensional random samples 1 , . . . ,
m+1 ∼ Np (0, Ip ) from the first population and p-dimensional random samples z1 , . . . , zn+1 ∼ N(0, Ip + ∆) from the second. We stack these random samples to obtain the data matrices X and Z and sample covariance matrices
S1 := XXT , S2 :=
ZZT , S := S
1 S1 .
[2] (a) Assume n,m, p → ∞ such that yp := p/n → y ∈ (0, 1) and cp := p/m → c > 0.
Take δ1 = δ2 = 0, y = 1/4, and c = 3/4, what is the lower bound a and the upper bound b of the limiting spectral distribution of S? For each, give a formula in terms of c andy. Also give a numerical value.
[2] (b) Suppose that δ1 = −ε and δ2 = +ε for ε = 1/10. Would you expect S to have
[2] (c) In the paper [A] (see also [B]) it is suggested that the largest eigenvalue λ1 of S,
√p )) , behaves like a Tracy-Widom distribution of order 1 . Show this using a simulation in the case n = 400, yn = 1/4 and cn = 3/4. Plot the histogram and compare it against the Tracy-Widom distribution of order 1 .
[2] (d) Considering [B], suppose that δ1 < ℓ and δ2 > κ for some choice of ℓ and κ . What
would be the critical values of ℓ and κ that would ensure you would have a large fundamental spike and a small fundamental spike? Give a formula for ℓ and κ and also give a numerical value in the case y = 1/4 and c = 3/4.
you found in (d), then give a formula for each of the two locations where you think the spike eigenvalues will cluster around and also a numerical value for each. [1 mark] Also, perform a simulation experiment to illustrate this phenomena. That is, sample data and plot a histogram of eigenvalues of S, compare it to the theoretical density expected if δ1 = δ2 = 0, and plot the location where you expect spike eigenvalues to cluster around. Take n = 400, yn = 1/4, and cn = 3/4. [1 mark]
[2] (f) Consider the signal detection problem where we are trying to determine the number
of signals in observations of the form
xi = Usi + εi , i = 1, . . . ,m, (SD)
where the xi ’s are p-dimensional observations, si is a k 根 1 low dimensional signal (k 冬 p) with covariance Ik , U is a p 根 k mixing matrix, and (εi ) is an i.i.d. noise with covariance matrix Σ2 . None of the quantities on the right hand side of (SD) are observed. In Section 7.2 of [B], they propose to estimate the number of signals k by
k(ˆ) := max{i : λi ≥ β + log(p/p2/3)},
where (λi ) are the eigenvalues of S. Reproduce Case 1 in Table 1 of [B] for the Gaussian case for values p = 25, 75, 125, 175, 225, 275. Fix y = 1/10 and c = 9/10, further parameters and setup can be found at the bottom of p.436 and on p.437 .
Question 2 [8 marks]
In this question, we shall consider high-dimensional sample covariance matrices of data that is sampled from an elliptical distribution. We say that a random vector x with zero
mean follows an elliptical distribution if (and only if) it has the stochastic representation
where the matrix A e Rp ×p is nonrandom and rank(A) = p , ξ ≥ 0 is a random variable representing the radius of x, and u e Rp is the random direction, which is independent of ξ and uniformly distributed on the unit sphere Sp−1 in Rp , denoted by u “ Unif(Sp−1 ). The class of elliptical distributions is a natural generalization of the multivariate normal distribution, and contains many widely used distributions as special cases including the multivariate t-distribution, the symmetric multivariate Laplace distribution and the symmetric multivariate stable distribution.
[2] (a) Write a function runifsphere(n,p) that samples n observations from the distribution
Unif(Sp−1 ) using the fact that if z “ Np (0, Ip ) then z/Ⅱz Ⅱ “ Unif(Sp−1 ). Check your results by: (1) set p = 25, n = 50 and show that the (Euclidean) norm of each observation is equal to 1, (2) generate a scatter plot in the case p = 2, n = 500 to show that the samples lie on a circle. [1 mark]
Show that you can simulate a multivariate t-distribution tν (0, Ip ) by setting ξ “
“ν/C in (⋆) with A = Ip and C “ χν(2) . Do this by sampling observations x1 , . . . , xn
and comparing the two marginal histograms of the observations against the density of the univariate tν distribution. Take p = 2, n = 1000, ν = 2. [1 mark]
[2] (b) Suppose that x1 , x2 , . . . , xn are p-dimensional observations sampled from an elliptic
distribution (⋆). We stack these observations into the data matrix X and calculate the sample covariance matrix Sn := XXT /n. Theorem 2.2 of the recent paper [C] is a central limit theorem for linear spectral statistics (LSS) of Sn . For example, Eq. (2.10) in [C] provides the case of the joint distribution of the LSS φ1 (x) = x
and φ2 (x) = x2 . Following the notation used there (for all the following terms in
this question). Perform a simulation experiment to examine the fluctuations ofβ(ˆ)n1
and β(ˆ)n2 . In the experiment, take Hp =
ξ “ k1 Gamma(p,1) with k1 = 1/^p + 1. Set the dimensions to be p = 200 and n = 400. Choose the number of simulations based on the computational power of your machine. Similar to Figure 1 in [C], use a QQ-plot to show normality.
[2] (c) In the recent paper [E], it is shown that if 1 ,
2 , . . . ,
n are p-dimensional observa-
tions sampled from an elliptic distribution (⋆) then the largest eigenvalue λ1 of the sample covariance sn (appropriately scaled) converges to the Tracy-Widom distribu- tion as long as a certain condition on the tail of the distribution holds (Condition 2.7 in the paper) . Perform a simulation to show that this holds true in the case of a double exponential distribution but not in the case of a multivariate student-t distribution. That is, simulate the largest eigenvalue of the sample covariance matrix and compare it to the Tracy-Widom distribution in each case . In the first case it should match (double exponential) and in the second case it shouldn’t (multivariate student-t) .
[2] (d) A nice property of elliptic distributions (⋆) is that the mixture coefficient ξ can
feature heteroskedasticity and the overall distribution of can exhibit heavy tails. Both are properties that are widely observed in financial and economic data, for example. In the recent paper [F], they proposed a more generalised setting whereby the observations
i = ξiAui , i = 1,...,n.
may exhibit the situation that
. ξi ’s can depend on each other and on {ui : i = 1,..., n} in an arbitrary way, and . ξi ’s do not need to be stationary.
The trick to dealing with these kind of observations is to self-normalise them. That
is, we consider the new observations 1 , . . . , x(˜)n where
i :=
.
The paper introduces two tests (LR-SN and JHN-SN) to consider the sphericity test
H0 : Σ 从 Ip v.s . Σ 从\ Ip
where 从 means “proportional to” . Reproduce the simulation experiment shown in Table 5 of [F] for the case p/n = 0.5 and only for LR-SN and JHN-SN for p = 100, 200, 500. Do this in the case of 1,000 replications.
Question 3 [6 marks]
We will consider some additional tasks relating to the above questions. These are for STAT6017 students only.
[2] (a) Unfortunately, the results of [C] do not cover all elliptic distributions due to a
moment condition on the distribution, see Table 1 in [C]. The results in [D] extend their results to more general elliptic distributions such as multivariate Gaussian mixtures1. A p-dimensional vector ∈ Rp is a multivariate Gaussian mixture with k subpopulations if its density function has the form
f ( ) =
pj φ(
;µj , Σj )
where (pj ) are the k mixing weights and φ( · ;µj , Σj ) denote the density function of the jth subpopulation with mean vector µj and covariance Σj . In the case where µ 1 = µ2 = · · · µk = 0 ∈ Rp and Σj = vj Σ for some vj > 0 with j = 1,...,k. Write an R function to sample from such a distribution using the representation from Eq. (11) in [D] .
[2] (b) Using your code from (a), perform a simulation experiment to simulate fluctations
ofβ(ˆ)2 := 1 x2 dFsn(x) under a Gaussian scale mixture model where the variable ξ has a discrete distribution with two mass points P(ξ = 1.8 √p) = 0.8 and P(ξ = 1.5 √p) = 0.2. Consider the cases: (i) p = 100, n = 150, (ii) p = 600, n = 900. In each case, plot a histogram of the distribution ofβ(ˆ)2 against the theoretical limiting density and also a QQ-plot similar to Figure 1 in [D]. Note: this is the experiment just above Section 3 in [D] .
[2] (c) In addition to Question 2 (d), also reproduce the simulation experiment shown
in Table 7 of [F] for the case p/n = 0.5 and only for LR-SN and JHN-SN for p = 100, 200, 500. Do this in the case of 1,000 replications.
[A] Han, Pan, Zhang (2016). The Tracy-Widom law for the largest eigenvalue of F-type matrices. Annals of Statistics, Vol. 44 .
[B] Wang, Yao (2017) . Extreme eigenvalues of large-dimensional spiked Fisher matrices with application. Annals of Statistics, Vol 45, No. 1.
[C] Hu, Li, Liu, Zhou (2019). High-dimensional covariance matrices in elliptical distributions with application to spherical test. Annals of Statistics.
[D] Zhang, Hu, Li (2022) . CLT for linear spectral statistics of high-dimensional sample covariance matrices in elliptical distributions. Journal of Multivariate Analysis.
[E] Jun, Jiahui, Long, Wang (2022) . Tracy-Widom limit for the largest eigenvalue of high-dimensional covariance matrices in elliptical distributions. Bernouilli.
[F] Yang, Zheng, Chen (2021). Testing high-dimensional covariance matrices under the elliptical distribution and beyond. Journal of Econometrics.