STAC51: Assignment 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAC51: Assignment 2
Deadline to hand in: Feb. 26 (Sunday) 10:00 pm, 2023
Total: 100 points
Please submit three files: R markdown file, knitted word file or pdf file from R markdown file, and hand-written scanned solution (if you have).
Note: Whenever you are using an R for generating random numbers, set seed to your student number. This can be done by simply adding the command set.seed(your student number) before generating the random number.
Q. 1 (25 pts) In this question we will do a simulation study to investigate some basic prop- erties of the confidence intervals for odds ratios for contingency tables based on multinomial sampling.
(a) (10 pts) Use R to generate ten (n=10) 2 ×2 contingency tables with total count (i.
e., grand total), N = 100, and with known cell probabilities (πll , π l2 , π2l , π22 ) = (0.2, 0.3, 0.3, 0.2) from a multinomial distribution.
nij ~ multinomial(N, πll , π l2 , π2l , π22 )
Please don’t forget to use the command, set.seed(your student num- ber) right before the command generating random data.
i. Print out your results (the 10 tables you generated).
ii. What is the true odds ratio θ (i.e. population odds ratio) for these tables?
iii. For each of these generated tables, calculate the odds ratio and a 95 % large sample confidence interval for the true odd ratio. Print all your table cell counts (i.e. for the 10 tables), estimated odds ratios (i.e. θˆ) and the confidence intervals (lower and upper limits).
iv. How many of the 10 intervals contain the true odds ratio, θ?
(b) (10 pts) Repeat part (a) but this time with n = 1000000. Do not print the tables etc this time, but instead
i. Calculate the proportion of the intervals containing θ .
ii. Comment on your value.
Note: Any table with a zero cell count has odds ratio equal to 0 or 1. Replace any zero cell counts by 0.5. (this is often done when dealing with zero cell counts)
(c) (5 pts) Repeat part (b), but this time with N = 15. (i.e. still a million tables but each table with grand total 15), and comment about your result.
Q. 2 (10 pts) In this question, we will prove the formula for SE(log(θˆ)) =′ + + + using delta method. The delta method is a useful method to derive the asymp- totic variance of a test statistic. Suppose that θˆ = , where pij is defined as below.
Column
1 2
Row 1 2 |
pll p2l |
pl2 p22 |
pl+ p2+ |
p+l p+2 N= n++
We want to derive the variance of log( θˆ). The multivariate version of the delta method is
Var(θˆ) s 又f (pll , pl2 , p2l , p22 ) ov(pll , pl2 , p2l , p22 ) 又 f (pll , pl2 , p2l , p22 )T
Where 又 is the gradient vector. That is
又f (pll , pl2 , p2l , p22 ) = / , . . . , \
We assume the multinomial sampling since the total number of observations is fixed.
Q. 3 (10 pts) The data contains results of a study comparing radiation therapy with surgery in treating cancer of the larynx.Do not use the R function fisher.test. However, you may use the R function dhyper to evaluate the expression.
Cancer Controlled Cancer Not Controlled
Surgery Radiation therapy |
nll = 21 n2l = 15 |
nl2 = 2 n22 = 3 |
(a) (5 pts) Test against the directional alternative that surgery is better than radi- ation therapy in controlling the cancer of the larynx using Fisher’s exact test. Find the p-value. What’s your conclusion of the test?
(b) (5 pts) Test against the two-sided alternative that Surgery and Radiation ther- apy differ in controlling the cancer of the larynx using Fisher’s exact test. Find the p-value. What’s your conclusion of the test?
Q. 4 (15 pts) A 2010 survey asked 827 randomly sampled registered voters in California. “Do you support? Or do you oppose? Drilling for oil and natural gas of the Coast of California? Or do you not know enough to say?” Below is the distribution of responses, separated based on whether or not the respondent is a college graduate.
College Grad
Yes No
Support 154 132
Oppose 180 126
Do not know 104 131
Total 438 389
(a) (6 pts) Test whether two variables are independent or not using
i. Pearson’s X2 test
ii. The likelihood ratio G2 test of independence.
Please write down every term in the test statistic explicitly before eval- uating them. For each test, report the degrees of freedom, and the P- values. Interpret the results.
(b) (3 pts) Test whether the proportion of college graduates supporting of offshore drilling equals to the proportion of non-college graduates supporting off- shore drilling using a two-sample test of proportions. Obtain the P-value.
(c) (3 pts) Do the conclusions of the tests in part (a) and (b) agree? Is it surprising or possible or is there anything wrong? Explain.
(d) (3 pts) Obtain the standardized residual for the chi-square test in (a) and de- scribe the association pattern between the two variables.
Q. 5 (15 pts) The table below shows results of an eight-center clinical trial to compare a drug to placebo for curing an infection. At each center, subjects were randomly assigned to groups.
Response
Center Treatment Success Failure
1 |
Drug Control |
11 10 |
25 27 |
2 |
Drug Control |
16 22 |
4 10 |
3 |
Drug Control |
14 7 |
5 12 |
4 |
Drug Control |
2 1 |
14 16 |
5 |
Drug Control |
6 0 |
11 12 |
6 |
Drug Control |
1 0 |
10 10 |
7 |
Drug Control |
1 1 |
4 8 |
8 |
Drug Control |
4 6 |
2 1 |
(a) Find the marginal table for the Treatment (drug, placebo) and Response (success, failure). Calculate and interpret the (sample) marginal odds ratios of the marginal table.
(b) Explain why it’s not a good idea to test the independence of Treatment and Response using the marginal table of Treatment and Response and ignore Center.
(c) Please test the conditional independence of Treatment and Response given Center using the Cochran-Mantel-Haenszel test. Please calculate the expected count and the variance for the cell (Drug, Success) for each of the 8 centers, write down every term in the numerator and the denominator of the CMH statistic explicitly before evaluating them, and find the P-value.
(d) Calculate and interpret Mantel-Haenszel’s estimate of the common odds ratio between Treatment (drug v.s. placebo) and Response (success, failure). Please write down every term in the numerator and the denom- inator of the estimate explicitly before evaluating it.
(e) Verify your calculation in the previous two parts using the R command mantelhaen.test.
Q. 6 (25 pts) Refer to the “Alcohol Use and Infant Malformation”, and the data in Table
2.6 on Page 44 of [ICDA3] (our textbook).
Let X = mother’s alcohol consumption and Y = whether a baby has sex organ malformation. For the five levels of alcohol consumption (0, ¡ 1, 1-2, 3-5, ≥ 6 drinks per day), use the midpoints (0, 0.5, 1.5, 4.0, 7.0) levels as the mother’s true alcohol consumption X. You can get the data into R as follows:
mydata = data .frame(drinks = c(0,0 .5,1 .5,4,7),
absent = c(17066, 14464, 788, 126, 37), present = c(48, 38, 5, 1, 1) )
mydata$total = with(mydata, absent + present)
mydata$proportion = with(mydata, present/total)
(a) (2 pts) Let π(x) be the probability of a baby having sex organ malformation if the mother’s alcohol consumption during pregnancy was x. We want to fit a linear probability model, π(x) = α + βx. Obtain the maximum like- lihood (ML) fit of the linear probability model with the glm() function.
(b) (8 pts) From the R summary output obtained above,
i. Write down the fitted regression equation for the model π(x) = α + βz .
ii. Interpret the intercept and slope in the context of the data.
iii. Estimate the probabilities of malformation for the lowest and highest alcohol levels: π(0) and π(7).
iv. Estimate and interpret the relative risk comparing the two levels in part (iii).
I suggest finding the estimated probabilities, “by hand” without using R funciton, predict.
(c) (2 pts) From the summary() output in part (a), get a 90% Wald confidence interval for the coefficient, β . Again, I suggest finding this by hands.
(d) (10 pts) Fit a logistic regression model
exp(α + βx)
π(x) =
i. Write down the fitted regression equation (x) for the model .
ii. Interpret the intercept and slope in the context of the data.
iii. Estimate the probabilities of malformation for the lowest and highest alcohol levels: π(0) and π(7).
iv. Estimate the relative risk comparing the two levels in part (iii).
v. Calculate the odds ratio of malformations for alcohol levels 7 vs. 0.
(e) (3 pts) Graph a scatterplot of the sample proportions of malformation vs. the level of alcohol consumption (0, 0.5, 1.5, 4, 7). On the same graph, show
i. the fitted line by the ML method from part (a)
ii. the fitted line by the Least Square method.
iii. the fitted logistic curve by the ML method from part (d). Why the slopes of the two straight lines differ so much?
2023-02-21