School of Mathematics and Statistics

MATH5885 Longitudinal Data Analysis

Term 2, 2021

Project

Due 23:59, Sunday, 1st August (end of Week 9) via Moodle.


        The project should be submitted via the Assignment tool. This tool is accessible via a clearly indicated link in the Assessments subfolder on moodle. Please, add a cover page containing a copy of your ID card, and write with your own handwriting:

        “I declare that this project is my own work, except where acknowledged and I have read and understood the University rules regarding Academic Misconduct”, and sign it.

        You must upload ONE pdf file containing all your working where all the R material should be at the back of the project’s pdf file and be titled “Appendix”. Please include sufficient working, computer code (adequately documented and commented) and output (adequately explained) so that I could fol-low what you have done. As it is known since George Box that “all models are wrong but some are useful” I do not expect any two submitted projects to be identical.

        Please note that there are page limitations for the MAIN PART of the report:

        maximum of 12 pages typed in minimum 12 pt font, single line spacing with minimum 2.5cm mar-gins, single sided which should include mathematical summaries of the models fit, essential R code and output only, any essential tabular and graphical output with a narrative about how you arrived at key modelling decisions, and your summary of findings or conclusions. You should also describe any model deficiencies and suggest possible remedies. Further details below.

        There are no page limitations for the appendix part of the report that should contain the com-plete R coding and any additional graphs and tables properly labelled so that the main report can cross reference these and so that I can quickly locate the relevant R code and additional tables and graphs should that be needed. This is NOT a defacto extension to your report. Your Part 1 Report should stand on its own and be readable without reference to the Appendix.

        If you are not skilled at producing typeset reports, then neatly handwritten reports are accept-able provided the specifications on font size, margins, line spacing etc described above are reasonably conformed to.


1 Project Background and Data

The project uses the CD4 dataset from DHLZ, introduced in Week 2. Please download the attached text file cd4data.txt to use the data for your current analysis. Any of the explanatory variables included in the data set may be considered for inclusion in your model, as well as fnctions of time. The response variable is CD4+ cell count but you may also wish to consider transformations of the response. Basic background information is available in documents:

1. DHLZ-CD4-BasicDataAnalysis.pdf, which contains some basic data analysis from Diggle et al.

2. ZegerDiggle-1994-Biometrics.pdf, which gives a published journal article using this dataset and explains the variables observed in the study — see in particular their Section 5 for details.

        The dataset consists of longitudinally collected observations on 369 subjects, resulting in a total of 2376 observations of CD4 cell counts denoted CD4 in the dataset. Other variables collected are:

1. Time: as the time (in years) since seroconversion, where a negative time denotes actual time before seroconversion.

2. Age: age at seroconversion (a baseline measurement), centred at 30 years of age, so that negative ages denote years younger than 30.

3. Packs: the number of packets of cigarettes smoked per day at time of measurement.

4. Drugs: a binary variable taking the values 1 or 0 to denote if the respondent takes recreational drugs or not respectively, measured at each time point.

5. Sex: number of sexual partners reported at each time point. Looks to have been centred somehow and truncated at ±5.

6. Cesd: an index of depression measured at each time point, with time trends removed. Higher scores indicate greater depressive symptoms.

        Zeger and Diggle (1994) suggest (Section 5):

        “The first objective of this analysis is to characterize the population average time course of CD4 decay while accounting for the following additional predictor variables: smoking (packs per day); recreational drug use (yes or no); numbers of sexual partners; and depres-sion symptoms as measured by the CESD scale (larger values indicate increased depressive symptoms). The analysis was conducted on square-root-transformed CD4 numbers whose distribution is more nearly Gaussian”

Later they state:

        “The linear regression coefficients (standard errors in parentheses) for the covariates age at seroconversion (years), packs of cigarettes, recreational drug use (0: no, 1: yes), number of sexual partners, and depression score are: .037 (.18), .27 (.15), .37 (.31), .10 (.038), and -.058 (.015), respectively. Age plays little role. Smoking, recreational drug use, and increased numbers of sexual partners are associated with higher CD4 cell numbers. This may reflect immune response stimulation or simply selection bias whereby healthier men choose to continue these practices. Increased depressive symptoms are significantly associated with decreased CD4 levels. Again, a causal direction cannot be inferred from this analysis.”

        These estimated regression coefficients seem to be those obtained by least squares in a model in which (page 694): “µ(t) was approximated by a knotted cubic spline with seven equally spaced knots.”

        Note that the model of Zeger and Diggle uses square root of the CD4 cell counts as the response variable and the other available variables are covariates. However, as they rightly point these other variables cannot be inferred to cause the level of CD4 cell counts.

        Available on Moodle is a document CD4InitialAnalysis.pdf. There is also an and accompanying R Script file called CD4InitialAnalysis.R. These provide some preliminary exploratory data analysis and an attempt to reproduce various results reported in Zeger and Diggle. As is often the case in scientific papers, there is typically insufficient detail available to allow exact reproduction of the findings. In particular, the point estimates and standard errors reported by Zeger and Diggle cannot be reproduced despite best efforts to do so.

        As a starting point, you should work through the R Script file CD4InitialAnalysis.R to ensure you understand what each part of that does. Then you should undertake your own analysis for the project as described in the next section.


2 Project Aims

The aim of the project is to determine a suitable model for the square root of CD4 cell counts as the response variable with covariates time (suitably modelled), age, cigarettes, CESD score, drug use and partners.

        You should proceed as follows:

1. Using and adapting the techniques introduced in the course and in the above R script, perform exploratory data analysis for the dataset in order to explore the mean structure, including the impacts of the various covariates on the mean response and to explore the covariance structure for the model randomness.

For example, this will include plots of individual and average profiles across time (possibly strati-fied by levels of the other covariates), investigation of covariance structure, and any other analyses you feel are relevant. Choose two or three preliminary fixed effects structures based on this analy-sis. In particular you might want to model the response to time as a combination of linear or other functions over segments of time. The model based on natural splines is provided as a starting point to flexibly model the temporal trend in mean response. But it may be possible to simplify this — up to you!

2. Fit these preliminary models using linear regression, comment on significance of regression coef-ficients and obtain the residuals from these models.

3. You should consider possible components in the models for the covariance structure including compound symmetry, unequal variances, random error, exponential or Gaussian autocorrelation decay. Use correlation and variogram analysis to propose possible models for the covariance of the residuals and any random effects components you may wish to include in the regression specification. Compare your alternative models using appropriate statistical model fit criteria and hypothesis tests. Select the best covariance model based on your analysis.

4. Consider whether your preliminary fixed effects structure needs to be adjusted in light of the chosen covariance model and refit the adjusted model. Make your conclusions.

5. Obtain the estimated covariance and correlation matrices for a selected patient with 7 or 8 mea-surements spanning (roughly evenly) time 0. Discuss how the variances vary with time, and how the correlations vary with time between measurements.

6. Select four patients with 7 or 8 measurements spanning time 0. Try to select a range of patients responding “high”, “medium” and “low” initially and over time. Use BLUPs to estimate the individual trajectories for these patients and plot them on the same graph, along with their observed levels of CD4 cell counts.


3 Your report

Write up a detailed report on your analysis. You should include:

Section 1: Introduction A very brief summary of the situation, the data and the objectives of your analysis and report.

Section 2: Exploratory data analysis Briefly describe the results of exploratory data analysis and sum-marize its results, including relevant graphical output.

Section 3: Model formulation This is the major section summarizing the steps taken and models tried in arriving at your final model.

● Describe and justify your model selection procedure, saying why you chose to fit the models you did.

● Explain why you prefer the model for fixed effects and error structure you ended up choosing.

● Formulate a model for the random errors in terms of random effects, serial dependence and pure noise.

● Write down the final fitted model for the mean response including standard errors and discussion of significance of covariates.

● Discuss the effect of the explanatory variables on the response.

● Discuss the main features of the covariance structure.

● Discuss the properties of the residuals in the model and any impact these may have on inferences you make about model fit and significance of model terms.

Section 4: Application to individual trajectories Include the results of the analyses specified in items 5 and 6 of the Project Aims.

Section 5: Discussion of modelling Discuss the difficulties you encountered with the analysis, and the limitations of your model (if any).

        The report’s quality will be assessed as if the report is for a decision maker who only wants the key details in the main report but may want to easily access further detail in the Appendices.