关键词 > STAT3602/STAT6008

STAT3602 Statistical Inference / STAT6008 Advanced Statistical Inference Mini-project

发布时间：2022-12-12

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DEPARTMENT OF STATISTICS AND ACTUARIAL SCIENCE

STAT3602 Statistical Inference / STAT6008 Advanced Statistical Inference Mini-project

(assessment weighting: 15%)

Due date: 16 December, 2022

Complete Q1–Q4 with the help of a convenient computer software of your choice.

Preliminary work

Let (X1 , . . . , Xn ) be an i.i.d. sample of observations drawn from a discrete distribution on {1, . . . , J} such that, for each i = 1, . . . , n,

P (Xi = j) = pj , j = 1, . . . , J,

for some unknown probabilities p1 , . . . , pJ satisfying j(J)=1 pj = 1. Deﬁne, for j = 1, . . . , J,

Nj = 1{Xi = j} = number of observations in the sample which are equal to j .

i=1

You may ﬁnd the following fact useful for completing this project:

Let α 1 , . . . , αr 2 0 and B > 0 be given constants. Subject to the constraints β1 , . . . , βr 2 0 and i(r)=1 βi = B, the product i(r)=1 βi(α)i is maximised by setting βi = Bαi / j(r)=1 αj , for i = 1, . . . , r .

Q1. (2.5%)

Let a1 , . . . , aK (K < J) be ﬁxed positive constants. Suppose that the probabilities pj ’s are assumed to satisfy the constraint

p1 /a1 = . . . = pK /aK .

Show that subject to the above constraint, the likelihood function pXi is maximised by setting

．．

pj = ．

．．

(

aj Ni

n ai , Nj /n,

j = 1, . . . , K,

j = K + 1, . . . , J.

Hint: Let p1 /a1 = . . . = pK /aK = c. Show that the problem reduces to maximisation of c N〉╱ with respect to pK+1 , . . . , pJ , c > 0, subject to the constraint K+1pi = 1 _ c ai .

j(J)=K+1pj(N)〉、

Q2. (2.5%)

Let a1 , . . . , aK (K < J) be ﬁxed positive constants satisfying

probabilities pj ’s are assumed to satisfy the constraints

ai < 1. Suppose that the

pi 2 ai for i = 1, . . . , K.

Show that subject to the above constraints, the likelihood function pXi is maximised by

setting

,．max , , aj (, j = 1, . . . , K,

．(．) J , j = K + 1, . . . , J,

． i=K+1 Ni

where Z solves the equation

j=1 max , , aj ( + Z _ 1 = 0.

Hint: First show that with (p1 , . . . , pK ) ﬁxed, the likelihood is maximised by setting

pj = ╱ 1 _ i=1 pi、, j = K + 1, . . . , J.

Next, proceed to maximise

K J K

G(p1 , . . . , pK ) 全 Nj ln pj + ╱ Nj、ln ╱ 1 _ pj、

j=1 j =K+1 j=1

w.r.t. p1 , . . . , pK , using partial diﬀerentiation, taking into consideration the constraints pj 2 aj for

j = 1, . . . , K .

Real data problem

The Word ﬁle text.docx contains two texts:

❼ (in Chinese, p.1–5) an extract from “倚天屠龍記” (The Heaven Sword and Dragon Saber),

written by 金庸(Jin Yong) in 1961, consisting of 8046 characters, with all punctuations removed;

❼ (in English, p.6– 18) an extract from “The Daughter of Time”, written by Josephine Tey in

1951, consisting of 9413 words.

For this project, you may choose to work on either the Chinese text or the English text.

For simplicity, in what follows the term “word” refers to either a Chinese character or an English word. The following tables list the top ten most commonly used Chinese and English words, together

with their usage rates:

(source: Chinese Character Frequency, https://humanum.arts.cuhk.edu.hk/Lexis/chifreq)

Word

Usage rate (%)

的一是不人有在了我中

3.6800 1.6830 1.4020 1.3850 1.1490 1.1020 0.9324 0.7592 0.7482 0.6201

(source: English Word Frequency , https://www .kaggle .com/datasets/rtatman/english-word-frequency)

Word

Usage rate (%)

the of and to a in for is on that

3.9338 2.2363 2.2100 2.0637 1.5441 1.4401 1.0089 0.8001 0.6377 0.5781

Based on the sample text in text.docx, we wish to compare the author’s usage of the above 10 most commonly used words against the “benchmark” usage rates given in the above table. Word frequencies of a text can be easily extracted online from the websites

https://www.browserling.com/tools/letter-frequency (for characters) https://www.browserling.com/tools/word-frequency (for words) .

Assume that all the words in the sample text are independently and identically distributed over the entire vocabulary, which can be coded as a ﬁnite set {1, 2, . . . , J} (it is not necessary to specify J). Without loss of generality, we may take {1, 2, . . . , 10} to be the set of the 10 most commonly used words. For j = 1, . . . , J, denote by pj the probability of the author’s using the word j .

Q3. (5%)

Let a1 , . . . , a10 denote the benchmark usage rates of the 10 most commonly used words. We wish to test

H0 : p1 /a1 = . . . = p10 /a10 vs H1 : no restriction.

(a) Give a layman interpretation of the null hypothesis H0 .

(b) Using the results obtained in Q1, conduct a generalised likelihood ratio (GLR) test and

report a p-value. You may approximate the null distribution of the GLR test statistic by a chi-square distribution on an appropriate number of degrees of freedom.

Does the sample text show evidence against H0 ?

(c) Instead of the chi-square distribution, the bootstrap method may be used to provide an alternative approximation to the null distribution of the GLR test statistic. For this purpose the bootstrap samples must be drawn in a way which respects the null hypothesis H0 . Thus, each bootstrap sample should be generated by (weighted) sampling with replacement from the sample text, where word j should be drawn with an estimated probability pˆj , which can be taken to be the constrained maximum likelihood estimate of pj under H0 , j = 1, . . . , J.

Calculate the constrained maximum likelihood estimates pˆj ’s. Based on these estimates, apply the bootstrap method to estimate the null distribution of the GLR test statistic, using 10000 bootstrap samples. Conduct the bootstrap test and report a p-value.

Hint: The GLR test statistic depends only on the counts of the 10 most commonly used words {1, 2, . . . , 10} and the length of the text. Thus, it is not necessary to really generate a full bootstrap sample of the same length as the sample text, which is unnecessarily time-consuming. Try to exploit the relationship between the 10 word counts and a multinomial distribution to simplify your computing process.

(d) Plot the cumulative distribution functions (cdf) of the bootstrap distribution obtained in

Q4. (5%)

As in Q3, let a1 , . . . , a10 denote the benchmark usage rates of the 10 most commonly used words. We wish to test

H0 : p1 2 a1 , . . . , p10 2 a10 vs H1 : no restriction.

(a) Give a layman interpretation of the null hypothesis H0 .

(b) Using the results obtained in Q2, calculate the observed value of the GLR test statistic.

Hint: The function

f (Z) 全 j=1 max { , aj } + Z _ 1

is a piecewise linear increasing function in Z, with f (0) < 0 and f (Z) → o as Z → o. Thus, there exists a unique solution to the equation f (Z) = 0, which can be found numerically by any convenient equation solver.

(c) We do not expect the chi-square distribution to be a valid approximation to the null distribution of the GLR test statistic for this problem. Why?

(d) As in Q3(c), calculate the constrained maximum likelihood estimates pˆj ’s under H0 . Based on these estimates, apply the bootstrap method to estimate the null distribution of the GLR test statistic, using 10000 bootstrap samples. Plot the cdf of the bootstrap distribution. Conduct the bootstrap test and report a p-value.

Does the sample text show evidence against H0 ?

* Points to note *

❼ In the main text of your report, show and explain your steps, and display formulae in their conventional mathematical form. Do not explain anything using computer code.

❼ Attach your computer code to your report as an appendix. Include brief comments on lines which involve complicated operations.