Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Courses: STA305H1S Sections L0101 & L0201

Homework: HW2

Due date: 17 February 2023

1. Submit your R/Rmd codes in a separate file in Crowdmark but in Box Qx R/Rmd. This file will be downloaded and evaluated but marks will be assigned to your answer in PDF file (Box Qx).

2. If mentioned specifically, R/Rmd codes may not be required for some questions or its parts.

3. Submit files separately for individual questions.

4. Computations using a language other than R will not be accepted.

5. PENALTY: Upload/submit your files in time to avoid penalty. Also consult the syllabus.

You may take help of the codes and concepts in Prof. Nathan’s textbook:

Link: http://designexptr.org/index.html/

Q1. [Total Marks for Q1=10]Follow the instructions in each part, a - d, of this question.

However, the questions in parts (a - c) are NOT GRADED. But I suggest you answer them to build your own understanding, but DO NOT upload on the Crowdmark.

Only Part (d) will be graded.

(a) Use R/Rmd codes to create a set of 25 patients (say n = 25), label them as Patient1,

Patient2, ..., Patient25, and create a data frame with a column named Subject.  Call this data frame as dfCausal” .  Keep the random generator seed value xed in the begining of the program, say set.seed(nnnn), choose your nnnn”, possibly as a part of your ID number.  In case you already know that some one else has used that number for this question, then use a different number here.

Ungraded: a1) Do you have any treatment assignment mechanism here? a2) Is that mechanism a randomised assignment? a3) Is the assignment ignorable?

(b) Extend the R codes in (a) further and create a column of treatment factor, Trt, with

two levels,  say ”1” and ”0” . Assign the treatments randomly to the units in the Subject column, such that proportion p = 0.7 of the subjects (i.e., patients) receive Trt =1 (i.e., level 1), labels=”New” (for Trt 1) and the remaining subjects receive Trt =0 (i.e., level 0) or labels =”Control” .  Update the data frame dfCausal by including Trt column vector.

Ungraded: b1) Do you have a randomized assignment here?  b2) Is the assignment ignorable?

(c) Extend the codes in (b) further and create a column of covariates, Gender (two levels:

1 for Female (”F”) and 0 for Male (”M”)) and Age (round to whole numbers), which is normally distributed with mean 60 years and standard deviation 8. Assign the Gender randomly to the units in the Subject column, such that proportion p = 0.6 of the subjects/patients receive Gender =1 (Females).  Update the data frame dfCausal by including the Gender and Age columns or variables. We will not work with Age in this question but I want it be in the data frame.

Ungraded: c1) Do you have a random assignment here, considering that you have a covariate Gender? c2) Is the assignment ignorable?

(d) Graded: Extend the codes in (c) further and create two columns of potential outcomes (i.e., the response variables), Y1 and Y0, measured by an omniscientist, or a perfect doctor.  As you know that in reality [practice], one can assign only either of the two levels of the Trt to the units and observe the response to this one level” of the Trt for that subject. However, the omniscientist knows the responses under both the treatment levels for each subject. For example, as discussed during the lectures, a perfect doctor knows the longevity of a patient under each of the two (surgical) treatments. Generate randomly Y1 as normal random variable N(mean=37, sd=5) and Y0 as N(mean=30, sd=6).  Round these Y1 and Y0 to integer values (i.e., no decimal points).  Consider the situation where the omniscientist assigns the treatment that will result into a higher response (longevity), and in case of equality, he will assign the treatment (New or Control) with ip of a balanced coin.  Create two columns for the omniscientist’s [perefect doctor’s] choice:  name the treatment column as TrtD and response column as YD where YD= higher of Y1 and Y0 subject-wise.

Combine all the previous steps with a larger number, say n = 1000 [You are free to use a different number]. This you can do by assigning the code n<- 1000 in the first r chunk and run all the chunks, or create a new r chunk with all the codes put together, along with n<- 1000 in the beginning, and create the data frame once in the last step.

At this step you might notice there are too many different values of Y1 ( or y1 here below) and Y0. It would be worth reducing these large number of different values into a fewer number of distinct values. You may have your own option to resolve or ignore this issue.  But one way is to use quantities (say, 土1.5σ).  Thus, you may replace by LimL = µ - 1.5σ, all those y1 s which are less than or equal to LimL, and so on. You may like to review the following codes to help you:

LimL<- mean(y1) - 1 .5 * sd(y1) ; LimU = mean(y1) - 1 .5 * sd(y1) y1<- mean(y1[y1<= LimL]) * (y1<= LimL)

+   mean(y1[y1<= LimU & y1 >  LimL])  *(y1<= LimU & y1 >  LimL) +   mean(y1[y1  >  LimU])  *(y1  >  LimU)

y1<- round(y1,0)

This way, y1 will have only 3 distinct values and you will nd multiple units resulting into various probabilities strictly between 0 and 1(0 < P (T l...) < 1) needed for checking the ignorability.

Follow similar steps to generate Y0.

Update the data frame dfCausal by keeping the columns of vectors: Subjects, Gender, Trt, Y1, Y0, TrtD, and YD.

Then answer the following.

(i) [Marks=2]What is the true average causal effect of the treatment (Trt) in the popu- lation of n units? Which treatment is better on average? Explain your answer.

(ii) [Marks=2]If we only observed the outcomes under the treatment that the ominscient

[perfect doctor] assigned then what is the average observed treatment effect?   Is it different compared to the true average causal treatment effect?

(iii) [Marks=3] Is it plausible to believe that these data, under the assignment of treat-

ment Trt, came from a randomized experiment? Defend your answer.

(iv) [Marks=3]For these data, i.e., using TrtD/Perfect doctor’s assignment, is it plausible

to believe that treatment assignment is strongly ignorable given the covariate Gender. Defend your answer.

Q2. [Total Marks for Q2= 16] Propensity scores and matched units for assess-

ment of treatment dierences on the response in an observational study:

Familiarize yourself with NT2022 Ch7 Causal Inference.Rmd” and the associated output posted on Quercus.  This question requires you to analyze a slightly different dataset with similar names of its columns.

Read the data columns from the file ”DataHW2Q2W2023.csv” uploaded on the Crowd- mark in addition to the question le. Recognize the following as the variables of inter- est:

Covariates: age, sex, race, educn (for university education), smkIntensity (for smoking intensity), smkYrs (years of smoking), exer (little/no exercise), active (inactive daily

life), wt75 (weight at base line of the study in 1975)

The treatment factor: qsmk (quit smoking by 1986)

The response variable: wt86 75 (change in weight = weight in 1986 - weight in 1975)

(a) [Marks= 2+2+2=6] To explain the treatment factor qsmk, fit a logistic model (with

binomial errors) in terms of all the covariates listed in the question Q2 (this includes continuous and categorical).  Use the R function glm() .  To check the significance of various covariates, use the function summary() on this tted model and comment on significance of the covariates in determing the propensity score.  Print the propensity scores for the first 25 units [1:25], with a column for qsmk.

(b) [Marks= 6] We now move to do matching of units for propensity scores between

groups for qsmk =0, 1 so that comparisons of the response could be done relatively less biasedly.  For this purpose, use the library(Matching) and functions Match( ), MatchBalance().  The output from the Match(), say rr < - Match(..) can be used to extract the units that appear in groups for treatment (i.e., qsmk ==1) and control (i.e., qsmk == 0). Suppose you wish to compare distribution of age after the matching has been done based on propensity score; you may extract the vectors of the age values in the two groups as:

age[rr$index .control]  ,  age[rr$index .treated]

Compare the distributions before matching (i.e., the units in groups qsmk = 0 vs 1) and after matching (i.e., using the units in index.control” vs. ”index.treated” groups) for the covariate age,  and the propensity scores, using their quantiles  [R function: qqplot().]

(c) For the response of interest, wt86 75, do the following comparisons and comment on each case:

[Marks= 1] Quantile - quantile plots of wt86 75 for qsmk =0 vs.  qsmk = 1.  Here you made no adjustment.

[Marks= 1] Quantile - quantile plots of wt86 75 for index.control” vs. ”index.treated” . Here you made the adjustment via matching the units based on propensity scores.

[Marks= 1] Using t-test based on assumption of equal variances and normal distri- butions, test the equality of means of wt86 75 for qsmk =0 vs qsmk = 1 groups. Here you made no adjustment.

[Marks= 1] Using t-test based on assumption of equal variances and normal distri- butions, test the equality of means of wt86 75 for index.control” vs. ”index.treated” . Here you made the adjustment via matching the units based on propensity scores.

Q3. [Total Marks for Q3= 12] Completely randomized design (CRD)

(a) Read the le DataHW2Q3W2023.csv” which contains the data obtained from a CRD

in 6 treatments (”Trt”) on a response variable (”Response”). Answer the following:

i [Marks = 2+2+( 0.5 × 4)=6 The researcher is interested in testing whether the

treatment affects the response.  What conclusion do you draw about the treatments? Use α =  0.05.   Based on this design, write model and its assumptions, obtain an ANOVA table for the data on Response”, and write:  the null and alternative hy- potheses, test statistic (TS) and the null distribution of the TS, test and conclusion.

ii [Marks = 1+1=2] Analyze the residuals to examine the departure from the model

you assumed to analyze this data, in particular: examine constancy of error variance, and normality of residuals.  Comment on any observed departures from the assump- tions.

(b) Assuming that the model used in part (a) is appropriate,

i [Marks= 1] using Fisher LSD method at α = 0.05 for pairwise comparison of means,

identify the treatment pairs with statistically significantly different means.

ii [Marks= 1] using Tukey’s HSD method with an overall experiment-wise error rate of

α = 0.05, identify the treatment pair with statistically significantly different means.

(c) [Marks = 1+1=2] For the CRD experiment in part  (a), suppose it makes sense to justify the following random effects model for the data yij   on response variable (”Response”) in le DataHW2Q3W2023.csv” with a = 6 treatments with a common n replications:

yij  = µ + τi + ∈ij

where µ is an unknown constant, τi i. . N(0, στ(2)) and ∈ij  i. . N(0, σ2 ); τi  s and ∈ij  s are uncorrelated (i = 1, 2, . . . , a; j = 1, 2, . . . , n).

Estimate στ(2)  and σ 2 .