STAT6108 Analysis of Hierarchical Data Assignment, 2022–23
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT6108 Analysis of Hierarchical Data
Assignment, 2022–23
● This assignment is worth 100% of the overall mark for STAT6108.
● The deadline for submission is 16.00 on Thursday 11 May 2023.
● Standard University policies and procedures will be followed for late submission, extensions and academic integrity (see the Module Outline for details).
● Please submit your answers to the three tasks below using the Turnitin link on Blackboard (see Module Outline for details) in a single file called report-ID .pdf, where ID is your student ID number, for example report-12345678 .pdf. In the Assignments folder, click on Assignment submission to submit your report. Please enter this file name as the Submission Title.
● Remember that the University places the highest importance on maintaining aca- demic integrity and expects all students to do the same. Please make sure you are familiar with the Regulations Governing Academic Integrity, which are available at http://www.calendar.soton.ac.uk/sectionIV/academic-integrity-regs.html.
● The page limits for each task are strict and is easily sufficient to receive full credit. Any pages beyond the limits will not be marked.
Task 1 [60 marks]
Maximum 9 pages plus short appendices
For this task, you can use MLwiN, R or STATA or any combination of the three to perform your analyses.
The dataset for this task contains the points score on A-level Chemistry, a qualification usually taken at age 18, of 31 022 students from 2410 schools. Each school is contained in one of 131 Local Eduction Authorities (LEAs). The dataset also contains five other variables which may explain differences in the students’ A-level Chemistry scores, including average GCSE score of the student, which can be considered as a summary of their academic ability prior to studying for the A-level.
The dataset is contained in the file chemistry .csv (available on Blackboard) and contains the following variables:
lea |
Identifier for the Local Education Authority (LEA) a school belongs to |
school |
Identifier for the school a student attends |
student |
Identifier for the student |
score |
Point score of the student on A-level Chemistry |
gender |
Gender of the student: 0 = female, 1 = male |
age |
Centred age of the student in months |
gcsescore |
Average GCSE score of the student |
gcsecent |
A centred version of the variable gcsescore |
Use exploratory data analysis and multilevel modelling to investigate the variability in A- level Chemistry scores across students, schools and LEAs, and how this varies by gender, age and the GCSE score of the students. Things to consider in your analysis include:
● Which of the potential covariates are required as fixed effects?
● How many levels are required?
● Which if any of the potential covariates require a random slope or coefficient?
● Whether school mean GCSE score should be included in the model as a contextual variable, and if so, whether there should also be a cross-level interaction between it and student GCSE score.
● How and if gender might be included as a contextual variable.
● To what extent the assumptions for the selected model hold.
The results of your work should be presented in a report of at most 9 pages. The report should contain a few important tables and figures that are discussed in the text. Short ap- pendices containing additional figures and tables, but few words, may be included provided the need for each appendix is justified in the text.
Below is an outline of the marking scheme so that you can assess the important elements for your report:
● Introduction — 5 marks
● Outline of the methods used — 5 marks
● Data description and exploratory data analysis — 10 marks
● Model selection and assessment — 20 marks
● Presentation and interpretation of results — 10 marks
● Conclusions — 5 marks
● General presentation of the report — 5 marks
Task 2 [25 marks]
Maximum 4 pages plus short appendices
For this task, you are not required to do a data analysis using a statistical software. You are free to use whatever you want to do the calculations wherever necessary.
A brief information about the dataset for this task is given as follows.
Discovery Day is a day set aside by the United States Naval Postgraduate School in Mon- terey, California, to invite the general public into its laboratories. On Discovery Day, 21 October 1995, data on reaction time and hand-eye coordination were collected on 108 members of the public who visited the Human Systems Integration Laboratory. The age and sex of each subject were also recorded. One experiment which demonstrates motor learning and hand- eye coordination, is rotary pursuit tracking. The equipment used has a rotating disk with a 3/4” target spot. The subject’s task is to maintain contact with the target spot with a metal wand. Trials were conducted for 15 seconds at a time, and the total contact time during the 15 seconds was recorded. Four trials were recorded for each of 108 subjects. The target spot on the Circle tracker keeps constant speed in a circular path. The target spot on the Box tracker has varying speeds as it traverses the box, making the task potentially more difficult.
The variables in this dataset that are relevant for this task are listed below:
time: Measurement occasion taking values in (0* 1* 2* 3}
gender: 0 if Male, and 1 if Female
cage: Age of subject centred to the overall average age
shape: 0 if Circle, and 1 if Box
score sqrt: Square root of score (outcome variable)
time sq: Square of time
cage sq: Square of cage
Some outputs from an explanatory data analysis (EDA) based on this data set are provided below.
Table 1: Sample means and standard deviations (within parentheses) of the square root of score
time=0 time=1 time=2 time=3
Overall 1.61 (0.67) 1.77 (0.68) 1.84 (0.70) 1.90 (0.73)
Male Female |
1.68 (0.69) 1.48 (0.63) |
1.88 (0.67) 1.58 (0.66) |
1.97 (0.71) 1.62 (0.65) |
2.03 (0.75) 1.68 (0.63) |
Circle Box |
1.78 (0.81) 1.51 (0.56) |
1.81 (0.80) 1.74 (0.61) |
1.90 (0.85) 1.81 (0.61) |
1.96 (0.85) 1.86 (0.66) |
Figure 1: Score sqrt versus centred age (on the left); individual profiles (on the right)
Answer the following questions.
Q.1. Using the information about the data set as well as the results from EDA provided above (Table 1 and Figure 1), explain what kind of data you have, including the specification of the hierarchical levels, commenting on potential covariates on the outcome variable, and justification of the method of analysis that you think maybe suitable for this data. [4 marks]
Q.2. The outputs from two empty models are presented in Table 2. Specify the covariance
structures assumed under these models. Comment on which model among the two might be more reasonable for this particular data, and explain why by discussing the merits of the model you have chosen over the other one. [4 marks]
Table 2: Outputs from two empty models. Standard errors within the parentheses
Model |
Parameter |
Estimate |
Linear regression model |
intercept residual variance: 7e(2) |
1.777 (0.034) 0.492 |
|
intercept |
1.777 (0.064) |
Marginal model with |
variance parameter: 7e(2) |
0.490 |
exchangeable structure |
variance parameter: p |
0.882 |
N.B. The variance function for the marginal model is re-parametrised as 2(σe(2)g p), with σe(2) 三 σ 2 + σ1(2), where σ 2 and σ 1(2) are the notations used in the Lecture slide 24 of Section 6.
Q.3. Two multilevel models and marginal models with several correlation structures are fitted to the data. All the outputs from model fitting can be found in Tables 3–5. Choose a model among them, and justify your choice. [5 marks]
Q.4. Write the regression equation and the model assumptions for the model you have
chosen in Question 3 (Q.3.). [3 marks]
Q.5. Calculate the variance-covariance and correlation matrices under the model you have
chosen in Q.3. Comment on the covariance structure. [4 marks]
Q.6. Using the results in Table 4, interpret all the fitted regression coefficients of the
model you have chosen in Q.3. [5 marks]
Table 3: Goodness-of-fit statistics and the number of parameters for multilevel models and marginal models with different covariance structures
Model/Structure -2*LogLik AIC BIC Nb. of parameters
Random intercept (RI) |
191.93 |
215.93 |
264.47 |
12 |
Random slope (RS) |
188.18 |
216.18 |
272.81 |
14 |
Compound symmetry (CS) |
191.93 |
215.93 |
264.47 |
12 |
Heterogeneous CS |
190.61 |
220.61 |
281.28 |
15 |
First-order autoregressive (AR1) |
213.70 |
237.70 |
286.24 |
12 |
Heterogeneous AR1 |
212.87 |
242.87 |
303.54 |
15 |
Unstructured (Unstr.) |
182.90 |
222.90 |
303.80 |
20 |
2023-04-14