Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit



Research Data Management

Final Project

 

NOTE: Please submit your answers to questions 1-3 in a single SAS code file and your answer to question 4 in a word document to the appropriate drop box on ICON. You will need to create a sub-folder in “P:Classes\Research Data Management\Final Project” and label it using your first and last name (e.g., my folder is “Scott Cleven”). This will be where you save the PDF created in Problems 2 & 3 and your subset datasets from question 3. Do NOT include any other output in the PDF that has not been explicitly requested. Make sure your folder is empty when you submit your code! Also, do not share the project dataset with anyone! This is a private dataset, not for public use.

 

 

Background:

For any disease or condition, developing appropriate biomarkers is an essential step in the process of creating a treatment or cure.  Developing biomarkers for neurological diseases is particularly difficult. Huntington’s Disease is one such neurological disorder. It’s a genetic disease, caused by having 36 or more repeats of the Cytosine, Adenine, and Guanine (CAG) nucleotide bases in the DNA of a prodromal individual. Naturally, since the disease is neurological, it might be interesting to see which regions of the brain are more heavily affected by Huntington’s Disease. Also, the classification of Huntington’s Disease is somewhat vague: does a prodromal with 36 CAG repeats have the same severity of symptoms as a prodromal with, say, 50 CAG repeats? This project aims to tackle these questions using methods learned from class.

The dataset given is under “P:Classes\Research Data Management\Final Project\HDData.csv”. The dataset has 2,474 observations. However, you’ll notice after reading it in that the dataset only has 1075 unique subjects and that each subject may have been measured multiple times. In this project we only focus on three sections of the brain. The variables in the dataset are as follows:

Variable

Description

SUBJID

Subject’s unique ID number

AGE

Subject’s age at time of visit

CAG

Subject’s CAG repeat length

putamen

Volume of the subject’s putamen portion of the brain

caudate

Volume of the subject’s caudate nucleus portion of the brain

hippocampus

Volume of the subject’s hippocampus portion of the brain

 

The goal of this project is to see specifically how Huntington’s Disease affects brain volume in each of these three regions. To speak in more statistical terms: Our Null Hypothesis is that the brain volume of a control is equal to the brain volume of a prodromal patient. But what we actual believe (our alternative hypothesis) is that neurodegeneration affects prodromals greater than controls (having Huntington’s Disease shrinks or degrades your brain more than those without it).

 

Question 1:

A) Read in the dataset. (I advise you to proc print and compare to the csv file I have on ICON (NOT the one in the shared folder) to compare).

B) You’ll notice the dataset is in the long format. To keep this consistent with our long format datasets we’ve used in the past, create a new variable called visitNum that tells the times patient had their brain data recorded. Assume the CSV dataset is pre-sorted by date of visit.

C) Create a new variable called prodromal that is 1 when the patient is prodromal and 0 when the patient is a control.

D) Create another new variable called age_group that sections people into one of three age groups using the 33% and 67% quantiles as cutoffs (you can say less than the 33% quantile is group 1 then less than 66% but greater than or equal to 33% is group 2 and the rest are group 3)

a. This part can be kind of tricky, try using this formula to create age_group:

b. age_group = floor(rank*k/(n+1)); where:

i. n is the total number of observations (You can use 2474 because we know it, but good coders will find another, generic, way of finding n)

ii. rank is the ordered position of each observation’s age (rank of the minimum age’s observation is 1 and rank of the maximum age’s observation is n)

iii. k is the number of groups we want to create.

iv. *Note this formula is slightly different than what we want but a simple fix to it will get the results we desire.

E) In order to make our future proc reports look nicer, we’re going to have to create a new variable called region which has values “putamen”, “caudate”, and “hippocampus” and another new variable called volume which is the volume value of the respective region for that subject. This means: every subject will have 3 rows for every one of their visitNum or visit counts. So, if someone gets tested 5 times (visitNum = 1 – 5) they will have 15 rows all with their unique SUBJID.

a. You will then need to drop the old variables putamen, caudate, and hippocampus.

 

Question 2:

A) First off, copy and paste this code for use for questions 2 and 3:

 

PROC FORMAT;

VALUE age_group 1 = "1Young"

2 = "2Middle"

3 = "3Old";


VALUE prodromal 1 = "Prodromals"

0 = "Controls";


VALUE CAG_group 1 = "1Low"

2 = "2Medium"

3 = "3High";

RUN;


a. What we’re doing here is creating three new informats to use in our proc reports later. You’ll eventually end up using them like this:

DEFINE prodromal / FORMAT = prodromal. …;

DEFINE age_group / FORMAT = age_group. …;

B) Use a proc report to compare the average brain size for each section of the brain between controls and prodromals within each age group. Also be sure to include the number of observations and the standard deviation for use in part C.

C) Create three new columns called “Young z-stat”, “Middle z-stat”, and “Old z-stat” that compare the control’s average brain volume to the prodromals for each respective age group. These are calculated as such where c is for controls, and p is for prodromals. Note that you’ll need to change every value in the formula for each column since we need to calculate this three times for each comparison of age groups:

 

D) Add a different-shaded final row for the overall mean and standard deviation for all brain locations.

E) Make sure all labels read nicely and that you have an appropriate title for your report. You can format the mean and standard deviation variables as 8.7 for the whole assignment.

 

Question 3:

A) Convert our dataset from long format to wide format, but there’s a catch. This is going to be a little different than our examples from class since our class examples had minimal visit numbers and had only one changing variable between those visit numbers. That’s not the case for this dataset and instead of creating 20 new variables (5 for each brain area and 5 for age since there is max 5 visit numbers), I’d rather create 4 new summary variables about those brain areas and age. So, convert from long to wide format by calculating each subject’s average age, average putamen volume, average caudate volume, and average hippocampus volume.

a. You can call these variables ave_age, ave_put, ave_cau, and ave_hipp.

b. Also create a variable called totalVisits, I won’t actually have you use this in any future parts, but it will probably help for this part of question 2.

c. In addition to the variables listed above, you’ll only want to keep the SUBJID, CAG, and prodromal variables.

d. *Note, I’m calling this wide format, but the dataset will probably end up being thinner than your updated long format dataset.

B) Re-create the age_group variable using ave_age. (This should be fairly simple if you figured it out earlier).

C) Create two new smaller datasets from your wide dataset called controls and prodromals containing only their respective observations. (You should drop the prodromal variable from each of them)

D) In the prodromals dataset, create a new variable called CAG_group that will be created similarly to age_group. There will be three groups using the 33% and 67% quantiles as cutoffs (you can say less than the 33% quantile is group 1 then less than 66% and greater than or equal to 33% is group 2 and the rest are group 3)

E) Re-create the region and volume variables from the long dataset for both datasets.

F) Create a proc report comparing the mean brain sizes for each age group in the controls dataset with the shaded row for overall sample size, mean, and standard deviation at the bottom. Use proper formatting and give me a good title for a good looking report.

G) Create a proc report comparing the mean brain sizes for each age group within each cag group for prodromals. Include the overall row at the bottom as well. Use proper formatting and give me a good title for a good looking report.

H) Create another proc report of only the 9 columns of z-stats comparing each age group between each CAG group. I ONLY want this report to have the z-stats columns and region rows (the overall row at the bottom is fine too), nothing else. Clearly label your z-stats so I know which comparisons they respond to.  Use proper formatting and give me a good title for a good looking report.

 

Question 4:

Complete all written parts of question 4 in a Word document and upload it with your SAS code on ICON.  

Question 4 is about drawing the conclusions talked about in the background of the assignment. You will need to know that any z-statistic that is larger than 1.96 is considered statistically significant (at the .05 alpha level). This means that if the z-statistic you calculated between any 2 groups is larger than 1.96 (or less than -1.96), those two groups are considered significantly different from one another or in more specific terms, one group has a significantly larger brain volume than the other group. The group with the larger brain is the one you labeled first in your subtraction if your z-statistic is positive, reversed if it’s negative. Use this to help you answer the following:

A) Based on your proc report from question 2, what do you notice about the overall comparison of brain volume in prodromals compared to controls?

B) Does your conclusion to part (A) hold true for each part of the brain? Which, if any, parts of the brain stand out as different.

C) Based on your prodromal proc reports from question 3, what do you notice about the overall comparison of brain volume in prodromals with smaller CAG repeat lengths to those with longer CAG repeat lengths?

D) Does your conclusion to part (C) hold true for each part of the brain? Which, if any, parts of the brain stand out as different.

E) Compare the control columns of the long dataset proc report to the control-wide dataset proc report (Just a simple visual comparison I’m not asking you to calculate z-statistics for this one). Do they look different? Does your answer make sense?

F) Check your answers to parts (B) and (D) with the internet. In other words, look for the symptoms (or lack thereof) of Huntington’s Disease on the internet and compare them to the parts of the brain I gave you. You will also need to research those parts of the brain to see what their main function(s) are. Then conclude, should Huntington’s Disease significantly affect those parts of the brain ultimately creating the symptoms that it does?

Question 5:

A) Go back and save all proc reports (4 of them) to one single PDF file in your folder. Don’t include anything else in your proc report.

B) Go back and save the controls and prodromals datasets in your folder as well.

C) Be sure to delete everything by the due date so I can run your code myself and see if you save them properly.