Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Stat 135

Midterm Review of General Results*

March 2, 2023

Abstract

We have studied two main aspects of inference: estimation and the testing of hypotheses.

1    Estimation

1.1    Estimating the mean of a nite population.

The population consists of N numbers with mean  and standard deviation a . In case the popu- lation is divided into two categories, the numbers are 0’s and 1’s, the mean is the proportion p of 1’s, and the SD is ^pq . 7.1, 7.2.

Random sampling (with or without replacement) results in observation X1 ,X2 , . . . ,Xn . If the sampling is done with replacement, the Xi’s are i.i.d. If it is done without replacement, the sample is called a simple random sample.  Its elements all have the same distribution and indeed are exchangeable,but they are not independent. 7.3

The sample average X is an unbiased estimate of  , whether the sampling is done with or without replacement.  The standard error of X is a/^N if the sampling is done with replace- ment. This standard error must be multiplied by the nite population correction factor if the sampling is done without replacement. 7.3.1, 7.3.2.

If the sampling is done with replacement, the CLT implies that for large n the sampling dis- tribution of X will be approximately normal, no matter what the distribution of the population. This can be used to construct condence intervals for  . The same holds in the case of simple random sampling, provided n is large but small compared to N . 7.3.3.

If a is unknown and n is large, the estimates  or S can be used in place of a in the calculation of the conidence interval.  This is an example of bf bootstrapping.  The estimate S2  is unbiased for a2 . 7.3.2. When the sample size is large,  and S are almost equal.

If a  is unknown and you are sampling without replacement a large number relative to the population,then you have to be careful about corrections. The table at the end of 7.3.2 (page 214) has a summary, but you don’t have to memorize them for the midterm.

We discussed what the "conidence" in conidence intervals means. 7.3.2, 7.3.3.

1.2    Estimating parameters of an underlying distribution.

Now the model is that we have n i.i.d.  observations from a distribution which has parameters. There are two main techniques for estimating these parameters. 8.3

1.2.1    The Method of Moments (8.4)

The population moments are computed from the underlying distribution, and are functions of the parameters. The sample moments are averages of powers of the sample. The sample moments are unbiased estimates of the corresponding population moments, and converge in probability to the corresponding population moments.

To compute the method of moment estimates,

(i) Calculate the population moments in terms of the parameters

* partially attributed to Prof.  Ani Adhikari

(ii) Re-write the results of (i) so that the parameters are expressed in terms of the population moments.

(iii) The method of moments estimate of a parameter is its expression in (ii) with the population moments replaced by the corresponding sample moments.

If the distribution of the method of moments estimate is known or can be approximated, it may be possible to construct conidence intervals for the parameter.

Tools. Population moments can be computed directly from the population distribution, or by using moment generating functions, 4.5.  [You are not expected to work with m.g.fs on the midterm.]  If the expectation and variance of the method of moments estimates are not easy to compute directly, they can be approximated. The 6-method is one way to do this 4.6.

1.2.2    Maximum Likelihood Estimator (MLE) (8.5,8.7)

The likelihood of the data is the joint density, or the joint probability function in the discrete case, of the data. The data may be i.i.d. from a distribution, or they may be dependent observations.

To compute the maximum likelihood estimate of a parameter, treat the data as ixed and maximize the likelihood as a function of the parameter.  The maximizing value of the parameter is the estimate.  How you maximize the likelihood depends on the complexity of the likelihood function. Don’t compute the log right away - irst look at the likelihood itself and see if it’s easy to maximize directly (e.g. in a couple of your homework problems the likelihood was monotone). If not, and if it’s a product, take the log and now stare at the log likelihood. Perhaps that’s easy to maximize directly. If not, take the derivative, set equal to 0, etc.

If you have n i.i.d.  observations from an underlying density (or probability function) that is smooth and well-behaved, the maximum likelihood estimate of the parameter  has nice properties when the sample size is large.   It is consistent, it is asymptotically unbiased, its approximate variance is 1/nI() where I() is the Fisher information, and it is asymptotically normal. All this can be used to construct approximate conidence intervals for 8.5.2, 8.5.3.

The  Cramer-Rao bound says that the variance of every unbiased estimate of   based on X1 ,X2 , . . . ,Xn  is at least as large as 1/nI().   Just read the statement of Theorem A in 8.7. So the MLE is asympotically e伍cient. And for a ixed n, if the MLE is unbiased and has a vari- ance equal to 1/nI(), then no other unbiased estimator can beat it (however, there may be a biased estimator with smaller variance).

1.2.3    Su伍ciency and a Minimal Su伍cient Statistic (8.8)

The likelihood f(x1 , . . . ,xn |) contains all of the information to ind the MLE of  .  When we ind the MLE we ind the value of  that maximizes the likelihood function. A su伍cient statistic is a summary of our data that still contains all of the information about  that the likelihood function contains. To make this clear, you can think of it a couple diferent ways. The likelihood function partitions the outcome space into equivalence classes in exactly the same way as a minimal su伍cient statistic does. Any statistic that results in a iner partitioning of the outcome space is still a su伍cient statistic. On one end of the spectrum you have the minimal su伍cient statisitic which gives the coursest partioning of the outcome space.  On the other extreme you have the full data set, also a su伍cient statistic, which is the inest partioning of the outcome space.  Alternatively, by the factorization theorem on page 306 in Rice, a statistic, T, is su伍cient if the likelihood can be factorized as

f(x1 , . . . ,xn |) = g(T(x1 , . . . ,xn )|)h(x1 , . . . ,xn ).

It follows that to ind the  that maximizes the likelihood function you only need to ind the  that maximizes the funciton g(T(x1 , . . . ,xn )|) whose only dependence on x1 , . . . ,xn  is through T.

A su伍cient statistic can be used via the Rao Blackwell theorem to make a better estimator (i.e.  one with a lower mean squared error).  Distributions whose parameter space has the same dimension as the dimension of the minimal su伍cient statistic is said to belong to the exponential family. For example, the normal distribution has a two dimensional parameter space (u,a) and a two dimensional minimal su伍cient statistic (X,S2 ). Almost all of the named distributions belong to the exponential family.

1.2.4    MLE and MSS (8.8)

Both the MLE and the MSS are a function of every su伍cient statistic:

By the factorization theorem the MLE is the theta that maximizes the funtion g(T(x1 , . . . ,xn ),). Taking the derivative of g(T(x1 , . . . ,xn ),) with respect to , setting it equal to zero, and solving  for  shows that the MLE is a function of the su伍cient statistic T(x1 , . . . ,xn ).  It follows that  the MLE is a function of any su伍cient statistic (including the MSS). The MSS by deinition is  a function of every su伍cient statistic (recall it is the coarsest possible su伍cient partition and a  function makes a partition coarser).

So the MSS and the MLE both have coarser partitions than every other su伍cient partition however the MLE need not be su伍cient  (its partition may not be equal to or iner than the likelihood partition). If however, the MLE is a 1-1 function of a su伍cient statistic than the MLE is a MSS.

2    Testing

The set-up: X1 ,X2 , . . . ,Xn i.i.d. from some distribution which has an unknown parameter (maybe more than one). There is a claim about the value of an unknown parameter. How to evaluate the claim?

Terminology. Null and alternative hypotheses, which may be simple or composite. Com- posite hypotheses are often either one sided or two sided. A test has errors of Type I and Type II. It also has a signicance level, a rejection region whose boundary is often speciied by crit- ical values, and an observed signicance level which is also called the P-value. It has power against each ixed value of the alternative. 9.1, 9.2.

General facts. These apply to likelihoods based on well-behaved density or probability func- tions.

1. If both the hypotheses are simple, and if there is a test of level α that rejects the null when the likelihood ratio is small, then the Neyman-Pearson Lemma says that the likelihood ratio test is at least as powerful as any other test of the same or lower level. 9.2.

2. The generalized likelihood ratio is used to test composite hypotheses. As before, the null hypothesis is rejected when the ratio is small.  This can be used to set up rejection regions to achieve a speciied level. 9.4.

3. There is an obvious duality between conidence intervals and tests. Thus you may be able to examine a conidence interval to decide whether or not a particular null hypothesis will be rejected. 9.3.

4. The power of a test against a particular value of the alternative is the chance of rejecting the null hypothesis when that particular value of the alternative happens to be the truth. To calculate power you have to irst igure out the rejection region (this will be based on the signiicance level and the null distribution of the natural statistic) and then calculate the probability of the rejection region under the particular value of the alternative.  You have seen on homework the rough shape of the power function of certain standard tests if it is a two sided test the power curve looks like a well and if it is one sided it looks like one side of the well.

5. A test is uniformly most powerful against a composite hypothesis if for each ixed value of the alternative it is the most powerful test of its level.

Applications. All the tests described below are likelihood ratio tests. We have covered z and t tests for the mean of a population, and for the diference between the means of two populations.

Note. Remember the duality between conidence intervals and tests. In the situations below, if you can test hypotheses about a parameter then you should be able to construct conidence intervals for the parameter.

Tests for the population mean u. Your sample is i.i.d. from some distribution with mean u and variance a2 .  An important special case is that of the dichotomous variable, where u = p and a2  = pq . The natural statistic is the sample mean , which has expectation u and standard error a/^n.

Large n. By the CLT, the distribution of X is roughly normal no matter what the underlying distribution of the population. Also, if a is unknown it can be estimated by either  or S because the two will be almost equal.   The normal curve can be used to set up rejection regions and compute approximate p-values. This is called a“one-sample z-test”. The power of this test against a particular u in the alternative can be be computed by using the normal curve.

This applies also when the sample is a large SRS from a inite population, provided the sample size is small relative to the population size. In such a case the sample is essentially i.i.d.

Small n. Now assumptions about the underlying distribution become important.

(i) If the population distribution is normal and a is known, then the distribution of X is normal with mean u and standard error a/^n, so the z-test works but as an exact test this time, not an approximation.

(ii) If the population distribution is normal with an unknown a, then (X - u)/(S^n) has the t distribution with n - 1 degrees of freedom.  This can be used in place of the normal curve and is called the t-test. There is no approximation here either.

(iii) In the dichotomous case the exact distribution of the number of successes is binomial (or hypergeometric, for simple random sampling).  This can be used to get exact p-values etc. Again, no approximation.

(iv) If the population distribution is neither normal nor dichotomous you may be able to look at the likelihood ratio directly and come up with a rejection region. You may have to use facts about the distribution in question, such as“sums of independent Poissons are Poisson.”

(v) If all else fails, use the bootstrap. Resample, and use the observed distribution of sample means instead of the normal, t, etc. All your p-values etc will most deinitely be approximations.

END OF MIDTERM REVIEW SHEET.

****************************************************************************

2 sample tests from chapter 11, discussed below, are important likelihood ratio tests you will need to know for the inal and are included here for completeness.

Tests for the dierence between two population means. If you have independent samples from the two populations,then just use the fact that X = Y has expectation uX - uY  and standard error ^aX(2)/nX  + aY(2)/nY .

The discussion is much as before, with the CLT etc. kicking in to make the large sample case very easy.  That is called the“two-sample z-test”.  In some situations, however, there is the new element of pooling.

1.  (i)Large n, dichotomous case If you are testing for the equality of two population pro- portions, then you should use the natural pooled estimate of this common proportion when you estimate standard errors under the null hypothesis.

2.  (ii) If you have two small independent samples from normal populations with the same unknown variance, then use the pooled estimate of the common variance, and use the t-test with nX  + nY  - 2 (in class we said n + m - 2) degrees of freedom.

The  "paired" case. If you have one set of individuals and you want to compare their mean responses to two diferent treatments (e.g.  you want to compare the mean pre-treatment blood- pressure and the mean post-treatment blood-pressure of patients in a population) then your ob- servations will be pairs (Xi ,Yi ) where both elements of the pair are measurements on the same individual. So if you think of your data as two samples X1 ,X2 , . . . ,Xn  and Y1 ,Y2 , . . . ,Yn , then the two samples will be dependent and so will X and Y . Therefore you can no longer use the formula aX(2)/nX  + aY(2)/nY  for the variance of X - Y . You have to account for the dependence. There are two main ways of doing this.

The irst way is to compute the individual diferences Xi - Yi  (thus combining the two samples into one), and then do a one-sample test for the mean of the diferences.  The SE calculation in this method is equivalent to including the covariance term in the calculation of Var(X - Y).

The second way of dealing with this is the sign test.  This just computes the proportion of positive diferences Xi - Yi , and compares that to 1/2 using the binomial/normal distributions.