ST5213 Categorical Data Analysis II
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Department of Statistics and Data Science
ST5213 Categorical Data Analysis II
Revision
1. Duchenne Muscular Dystrophy (DMD) is a genetically transmitted disease passed from a mother to her children. Boys with the disease usually die at a young age; but affected girls usually do not suffer symptoms, may unknowingly carry the disease and may pass it to their offspring. It is believed that about 1 in 3,300 women are DMD carriers. A woman might suspect she is a carrier when a related male child develops the disease. Doctors must rely on some kind of test to detect the presence of the disease. This data frame contains data on two enzymes in the blood, creatine kinase (CK) and hemopexin (H) for 38 known DMD carriers and 82 women who are not carriers. It is desired to use these data to obtain an equation for indicating whether a woman is a likely carrier. Specifically the following data was collected:
Variable |
Description |
Group CK H |
Indicator whether the woman has DMD (Case) or not (Control) Creatine kinase reading Hemopexin reading |
R was used to explore these data and to fit various models to them. Using the R output given on pages 5–9 in the appendix, answer the following questions:
(a) Can you think of a reason why the analysis looks at CK and log(CK) as possible
explanatory variables? Comment briefly.
(b) Do you think it would be necessary to also consider log(H) as a possible
regressor? Comment briefly.
(c) Which of these models is your preferred model? Justify your answer.
(d) Except for the estimate of the intercept, interpret each estimate in your pre- ferred model in a sentence or two.
2. Schoener (1968) collected information on the distribution of two Anolis lizard species (A . opalinus and A . grahamii) to see if their ecological niches were dif- ferent in terms of where and when they perched to prey on insects. Perches were classified by twig diameter, their height in the bush, whether the perch was in sun or shade when the lizard was counted, and the time of day at which they were foraging. The observed data is given in the following contingency table:
Lizard species
Height Diameter Sun Time A . grahamii A . opalinus
Low |
Thin |
Sun
Shade |
Early MidDay Late |
20 8 4 |
2 1 4 |
Early MidDay Late |
34 69 18 |
11 20 10 |
|||
Thick |
Sun
Shade |
Early MidDay Late |
8 4 5 |
3 1 3 |
|
Early MidDay Late |
17 60 8 |
15 32 8 |
|||
High |
Thin |
Sun
Shade |
Early MidDay Late |
13 8 12 |
0 0 0 |
Early MidDay Late |
31 55 13 |
5 4 3 |
|||
Thick |
Sun
Shade |
Early MidDay Late |
6 0 1 |
0 0 1 |
|
Early MidDay Late |
12 21 4 |
1 5 4 |
Assume we want to analyze this contingency table using loglinear models. Further assume we have a dataframe, say lizards, with variables H (height of perch), D (diameter of perch), S (whether perch is in the sun), T (time of day), Sp (observed species) and Count (number of observations in each category).
(a) State which variables are explanatory variables and which are response vari-
ables for the purpose of this analysis.
(b) State the minimal model. Is the minimal model a graphical model?
(c) This kind of analysis starts with the saturated model
glm(Count ~ Sp * H * D * S * T, data=lizards, family=poisson)
and then successively remove certain interaction terms. List all interaction terms that we (potentially) have to test.
3. This question concerns the effect on political party identification of sex and race by U.S. voters. The data is given in the following table.
Sex |
Race |
Party Identification |
||
Democrat |
Republican |
Independent |
||
male |
white black |
132 42 |
176 6 |
127 12 |
female |
white black |
172 56 |
129 4 |
130 15 |
On pages 10– 11 of the appendix, you find the output of five models, named fm0, fm1, fm2, fm3 and fm4, that were fitted to these data using R. Use this output to answer the following questions.
(a) Draw a model lattice that shows how these five models are nested within each other.
(b) Which of these models is your preferred model? Justify your answer using
likelihood-ratio tests.
(c) For your preferred model, write down the fitted equation for the log odds of a person preferring ‘Democrat’ instead of ‘Independent’ . Take care to define all the symbols that you use.
(d) For your preferred model, find the fitted equation for the log odds of a per- son preferring ‘Democrat’ instead of ‘Republican’ . Take care to define all the symbols that you use. Except for the estimate of the intercept, interpret each estimate in this fitted equation in a sentence or two.
4. Consider a random variable X with a binomial distribution with parameters n and π, i.e. X ~ Bin(n, π). Let x denote an observed value of X . The maximum likelihood
estimator of π is = . In this context, other parameters that are often of interest
are the odds θ = and the log-odds ψ = log θ = log
(a) Write down expressions for E(X) and Var(X).
(b) Verify, using the delta method, that the approximate mean and variance of
1 _
E(ψˆ) ≈ ψ and Var(ψˆ) ≈ 1
respectively.
(c) Using the result from the previous part, conclude that the (estimated) standard error of ψˆ is given by ′ + .
(d) Use the delta method to find expressions in terms of x for the approximate
mean and variance of θˆ =
(a) Why are there four intercepts? Explain how they determine the estimated response distribution for males in urban areas wearing seat belts.
(b) Construct a confidence interval for the effect of gender, given seat-belt use and location. Interpret.
(c) Find the estimated cumulative odds ratio between the response and seat-belt use for those in rural locations and for those in urban locations, given gender. Based on this, explain how the effect of seat-belt use varies by region, and explain how to interpret the interaction estimate, _0.1244.
2022-11-23