Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

PUBH2005 Assessment 1: Data analysis (assessing weeks 2-4)

March 2022

Information about this assessment:  

· This assessment is worth 30% of your overall grade. There are a total of 30 marks available, so each 1 mark corresponds to 1% of your overall grade. The number of marks each question is worth is shown, and each question is given a difficulty rating (easy, medium or hard), so that you know how difficult it should be. I have ensured that 18 of the 30 marks (60% of the total marks) are for questions which very closely resemble what we have already done in class (rated easy), so that passing this assessment is very possible for everyone so long as you have kept up with the class activities and lectures.

· The assessment is structured just like your class activities and uses the same Synthea data sets we have been using in class. The assessment is structured like an analysis, in which some sections are done for you (and you simply cut and paste the code into your own R code), while other sections are for you to complete. For each question, a space is provided in which you must include ALL your R code used to answer the question, all output requested (including plots) and any interpretation which you must type yourself (IN THAT ORDER). Your answer should fit inside the space provided.

· Part marks are available in the case that you are unable to produce the correct answer, but you can demonstrate that you understood what was being asked and did some things correctly. Thus, it is worth attempting every question. The questions are related to the lectures and exercises of weeks 2, 3 and 4. The assessment is broken into three sections (parts), each worth 10 marks.

· Questions regarding the assessment will be permitted ONLY if they are not directly related to marks awarded in a question. Questions must be asked on Blackboard so that all students have access to the answer. Other students should not answer questions on Blackboard as you may give away an answer.

· One last thing to note: You will notice that in the assessment we frequently change the dataset (e.g., now lets load the conditions data again). Some people may find this annoying, but it is necessary to ensure that if people make mistakes at some point, those mistakes do not follow through the entire assessment. Lets get started.

· The assessment was released 1 pm Thursday 23th March and you have exactly two weeks to complete it. It must be submitted by 1 pm Thursday 6th April.

· Assessments handed in after this time will receive late penalties as detailed in the unit outline. If you require an extension you must apply NOW and have a legitimate reason.

The use of generative AI (i.e., ChatGPT) in this assessment:  

· The use of these technologies is allowed, as they will play a central role in programming in your employment, and to some degree simply replace the method we used to use to find code (i.e., website internet forums such as Stack Exchange).

· I have written this assessment in a way which is aimed at ensuring ChatGPT can’t simply answer the questions without you knowing what you’re doing (i.e., to prevent you simply pasting the questions into ChatGPT and getting a ‘full marks’ answer). Hence, it is intended that ChatGPT may play an assistive role. This has been done primarily in four ways:

1. Questions ask you to use methods from “week #” to solve a problem, so that if you simply paste this question into ChatGPT it will likely use a different method and you will lose part marks.

2. ChatGPT can give you code but can’t analyse data nor can it interpret the results form a table or plot, and thus you will still need to be able to produce the results and interpret them yourself.

3. Some of the explanation in the questions has been removed/reduced, because it was too informative to ChatGPT.

4. Where I have asked you to explain code, I have requested you use your own words. So if you just cut-and-paste an explanation from ChatGPT you will lose marks.

· Although I can’t stop you using ChatGPT, If you are hoping to continue on in the data specialization, you should be focused on learning the R code and concepts for yourself. Thus, I would encourage you to not use ChatGPT or only use it to confirm your solution once you have solved it yourself.

Introduction

In this assessment we pretend that you are a data analyst at the Department of Health during an outbreak of a novel viral infection (COVID 19). The manager of your unit has asked that you apply your skills to health administrative data to provide information on this novel disease. The results will be used to assist clinicians in treating patients and help public health officials set priorities.

Part one: Data management and missing data

The first step in any analysis is to explore the data, check for missing values and clean or reformat as necessary. Let’s start by loading the patient data and storing it in an R object called patient.data, and removing patients who did not have COVID 19. Don’t forget to set your working directory before running the code below.

patient.data<-read.csv("Synthea_patient_covid.csv", na.strings = "")
patient.data<-patient.data[which(patient.data$covid_status==1),]

Question 1 (easy - 2 marks)

The first thing you want to know is how many episodes of care (remember a row represents an episode of care not a patient) ended in the patient’s death? Use the table() function to find and report this number (paste your code, output and write a statement such as “In total XX patients died”).

Question 2 (easy - 2 marks)

Next you want to know how many missing values each variable has. Use the same method and code from week 3 to obtain these results (note the code is complex but all you need to do is change the name of the data object). Underneath your output, list the variables which do have some missing data):

Question 3 (easy - 2 marks)

One of the variables which had missing data was DEATHDATE. However, this is expected because only patients who died will have a value on this variable. But there could still be genuine missing values on this variable if any patients who died don’t have a value for DEATHDATE. Make a table showing death status by missingness on DEATHDATE and report how many (if any) patients who died were missing a DEATHDATE (remember is.na() will return the values TRUE or FALSE for missingness.

Question 4 (easy - 3 marks)

Our data does not include a variable for age, but we can make one by running the following code (run it now):

patient.data$dob<-as.Date(patient.data$BIRTHDATE, tryFormats = "%d/%m/%Y")
patient.data$enc.date<-as.Date(patient.data$DATE, tryFormats = "%d/%m/%Y")
patient.data$age<-as.numeric((patient.data$enc.date-patient.data$dob)/365.25)

Now, use the hist() and summary() functions to plot and summarise the age variable (include your plot and code below). Report the mean and median for the age distribution, and the age group (in groups of ten years) which occurs most frequently in the data.

Question 5 (medium - 2 marks)

As the pandemic has progressed it has become clear that males are at increased risk of death compared with females. But before investigating if that is true in our data, we firstly need to see if age is distributed differently by gender (i.e., is one group older than the other?). To answer this question, make a box plot of age by gender and interpret the plot with regards to the distribution of age by gender.

Question 6 (difficult - 2 marks)

Your manager also wants to know the COVID death rate (percentage of patients who died) in your sample. Do this calculation now and report the percentage who died (hint: you will find the unique() function to be helpful).

Part two: Prediction models

In this section, we will start again. The code below firstly removes any data from R, then loads the three Synthea data sets, and loads the rpart() and rpart.plot() packages. NOTE: if you have closed R since the last activity you will need to set your working directory again.

rm(list=ls())
patient.data<-read.csv("Synthea_patient_covid.csv", na.strings = "")
obs.data<-read.csv("Synthea_observations_covid.csv", na.strings = "")
con.data<-read.csv("Synthea_conditions_covid.csv", na.strings = "")
library(rpart)

library(rpart.plot)

Now, you want to build a prediction model of death among patients with COVID 19. This will help clinicians know which patients are at greatest risk. The code below reformats our three data sets into one new data set pred.data which is ready to use for prediction modelling (use this code to prepare your data):

full.data<-Reduce(merge, list(patient.data,obs.data,con.data))
full.data<-full.data[which(full.data$covid_status==1),]
pred.data<-full.data[,c(5,8, 10:11,30,35,43,45,62,64:66,73,75:76,98,108,
                    115,117,122,134,146,148,173,176,125,194:195,209,217,218,
                    221:222,225:227,235,236)]
pred.data$GENDER<-as.factor(pred.data$GENDER)
pred.data$MARITAL<-as.factor(pred.data$MARITAL)
pred.data$ETHNICITY<-as.factor(pred.data$ETHNICITY)

Question 7 (easy - 3 marks) 

Explain what each line of code in the above is doing and why we took these steps to produce our pred.data object. Explain it line-by-line (e.g., Line 1 does A, B and C. Line 2 does X, Y and Z, and so on…). In your own words.

This data is very similar to the data we prepared in our week 4 exercises, with a few important differences. Here we are predicting death among the COVID+ sample. We have also removed the RACE variable to save time (i.e., we don’t have to dummy code). Before we can run our model we have one last step, we must separate our data into a training and testing set (run the following code).

set.seed(36457)
pred.data$random<-runif(nrow(pred.data))
train.data<-pred.data[which(pred.data$random<=.7),]
test.data<-pred.data[which(pred.data$random>.7),]
train.data<-train.data[,-39]
test.data<-test.data[,-39]

Next we can run a simple classification tree by running the rpart() function with the default setting as below, saving the model in a new object called tree.mod.

set.seed(73525)
tree.mod<-rpart(formula=dead~.,data=train.data)

Question 8 (medium - 2 marks)

Plot the tree diagram using the package rpart.plot, and underneath the plot write one statement describing which patients are at greatest risk of dying from COVID 19 and what is their probability of death? Which patients are at the least risk of dying from COVID 19 and what is their risk of dying?

Next, we have to obtain the predicted probability of death for each row in our data using the predict() function. After which we have to choose a cut-off value for this probability for which we say patients who are above this value are predicted to die from COVID 19. You decide to use a cut-off value of 0.5, so if any patient had a predicted probability above 0.5 then you will say your model predicts that they will die. Run the code below which does these steps for you:

train.data$pred.prob<-predict(tree.mod, train.data)
train.data$pred.out<-0 
train.data$pred.out[train.data$pred.prob>=0.5]<-1

Lastly, as you will recall from the week 4 notes, we can calculate the sensitivity and specificity of our model from the cross-classification table.

table(predicted=train.data$pred.out,actual=train.data$dead)

##          actual
## predicted    0    1
##         0 8301   68
##         1   32  142

Remember the sensitivity is simply the number of deceased patients who were correctly predicted to pass away (142) divided by the total number who passed away (142+68). Likewise, the specificity is the number of patients who were correctly predicted to not die (8301) divided by the total number who did not die (8301+32).

This gives us: Sensitivity = 142/(142+68) = 0.68 Specificity = 8301/(8301+32) = 0.99

Question 9 (easy - 3 marks) 

Repeat the above process but this time in the test.data, to obtain the sensitivity and specificity in the test data. By comparing the sensitivity and specificity between the train and test set, what do you conclude with regards to ‘model overfit’? Your first line of code is given below (remember you don’t need to run the classification tree model again, start by generating the predicted probabilities in the test.data which is done for you in the line of code below)

test.data$pred.prob<-predict(tree.mod, test.data)

Question 10 (easy - 1 mark) 

You manager is not happy with your model’s low sensitivity. S/he tells you to increase the threshold of the probability, so that only patients with a predicted probability above 0.9 will be classified as predicted to die. Why is this a bad idea in your model? (Note: this is an interpretation question - you don’t need to run/present any R code - but experimenting with this cut-off may help you discover the answer).

Question 11 (difficult - 2 marks) 

Your manager accepts your answer above, and now suggests you make a more complex model to improve the sensitivity. As discussed in the week 4 exercise, the cp argument (short for complexity parameter) controls how many splits our classification tree will have, by controlling how much improvement in classification of the outcome is achieved by each split to accept that split. Large values of cp mean we require larger improvements to accept a new split, and will lead to a less complex model. Conversely, small values of cp mean we require smaller improvements to accept a new split, and will lead to a more complex model. The default value for cp is 0.01.

Run a new classification tree in which you reduce the value of cp by one half the default value, and interpret the plot (do not change the default values for minsplit and minbucket as we did in the exercise). Using the same probability thresholds as above (0.5), calculate the sensitivity and specificity for this new model in the test and train data. Explain whether the model is better or worse by considering both changes in model performance (i.e., sensitivity and specificity) and over-fit.

IMPORTANT: firstly you will need to remove the pred.prob and pred.out variables from the train.data and test.data. The plot should work OK if you increase the plot panel window - but if it doesn’t just include it and I will know from your code if you did it correctly.

Part three: Imputation of missing data

In this section we’re going to bring together the ideas of week 3 and 4 data to impute missing data using regression trees. Importantly, this time we’ll do something we didn’t yet attempt in class, and that is to use trees to predict a continuous outcome (thus we will use a regression tree not a classification tree). Let’s start by loading and preparing our data, we’ll use the same data as in part two, except we’ll keep the full sample of COVID positive and negative patients:

rm(list=ls())
library(rpart)
library(rpart.plot)
patient.data<-read.csv("Synthea_patient_covid.csv", na.strings = "")
obs.data<-read.csv("Synthea_observations_covid.csv", na.strings = "")
con.data<-read.csv("Synthea_conditions_covid.csv", na.strings = "")
full.data<-Reduce(merge, list(patient.data,obs.data,con.data))
full.data<-full.data[,c(4,5,8, 10:11,30,35,43,45,62,64:66,73,75:76,98,108,
                    115,117,122,134,146,148,173,176,125,194:195,209,217,218,
                    221:222,225:227,235,236)]
full.data$GENDER<-as.factor(full.data$GENDER)
full.data$MARITAL<-as.factor(full.data$MARITAL)
full.data$ETHNICITY<-as.factor(full.data$ETHNICITY)

Now, our goal is to provide an estimate of the increased heart rate in COVID patients compared with non-COVID patients. However, we note that our outcome variable Heart.rate has quite a lot of missing data [confirm this: table(is.na(full.data$Heart.rate))]. Considering how much other patient information we have that is related to Heart.rate we think we can predict the values for those with missing data (also known as imputation). Furthermore, we can do this using our supervised machine learning technique, classification trees, which may build a good model for us.

Just as you did for part two, we now use rpart to build a prediction model (save it again as tree.mod), but this time you need to predict the continuous variable Heart.rate (also include cp=0.001 in the rpart function to make a more complex model). Further, just like before use predict() to get the predicted Heart.rate according to the model, and save it in a new variable called pred.heart.rate. This is done for you in the code below:

set.seed(73525)                                                     tree.mod<-rpart(formula=Heart.rate~.,data=full.data, cp=0.001) full.data$pred.heart.rate<-predict(tree.mod, full.data)

Question 13 (easy - 2 marks)

Plot the tree using rpart.plot and in a few lines explain some interesting features. You don’t need to explain everything about the plot or all the subgroups, just explain some interesting features which describe how the tree predicted Heart.rate and what the numbers in the terminal nodes represent (remember the outcome is continuous now, not binary, so think carefully about what these numbers represent). You may need to make your plot window larger or save the image as a PDF to see it properly.

Question 14 (medium - 2 marks)

Produce a scatter plot comparing Heart.rate (x-axis) with pred.heart.rate (y-axis) and comment on what you see. Why does the plot look the way it does? Do you think the method we chose to create the imputations (regression trees) was a good one? Why or why not?

Question 15 (hard - 2 marks)

Last of all, you need to make a final variable impute.heart.rate which includes the actual value of Heart.rate for those without missing values and the predicted value of pred.heart.rate for those with missing values (thus imputing the missing values). After you’ve done this, estimate the increased heart rate among COVID patients using the t.test, using both the original variable Heart.rate (i.e., listwise deletion) and the imputed variable impute.heart.rate. Are there any differences between the two estimates, and if so, is the difference important or trivial? Do you think doing imputation was necessary? Why?