Statistics for Data Analysis Coursework 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Coursework 2
Statistics for Data Analysis
2021/2022
1. The dataset used for this question is from 97 men with prostate cancer who were due to receive a radical prostatectomy. Before proceeding with the operations the doctors wanted to find out how a prostate specific antigen is affected by the other variables measured in the study. Table 1 displays each variable together with a detailed description. Table 2 displays the variance inflation factor for each predictor.
Acronym |
Decription |
lcavol lweight age lbph svi lcp gleason pgg45 lpsa |
log(cancer volume) log(prostate weight) age of patient log(benign prostatic hyperplasia amount) seminal vesicle invasion (1 for having the invasion, 0 otherwise) log(capsular penetration) Gleason rank (6, 7, 8, 9) combined percentage of Gleason patterns 4 or 5 log(prostate specific antigen) |
Table 1: Prostate data: Description of the variables
lweight |
age |
lbph |
svi |
lcp |
gleason |
pgg45 |
|
2.05 |
1.36 |
1.32 |
1.37 |
1.95 |
3.10 |
2.47 |
2.97 |
Table 2: Prostate Data - Variance Inflation Factors (VIF)
(a) Why do you think the doctors considered the logarithms of most continuous vari- ables instead of the actual measurements?
[2 marks]
(b) What is a Variance Inflation Factor (VIF)? What is it used for and what is the cut-off value that causes concern? Is there anything to worry about based on the VIF values of Table 2.
[4 marks]
A medical student ”fitted” the linear multiple regression model to this data, without taking into account the fact that variables svi and gleason are not really quanti- tative. Table 3 displays the least squares estimates of this model (m1).
(c) Now focus on the variables svi and gleason. Do the estimates make sense? Can you say with certainty how a change from ranking 6 to ranking 7 affects prostate specific antigens and how that differs from comparing ranking 6 to ranking 8? Explain your answer.
[4 marks]
|
Estimates |
St.Error |
pvalue |
Intercept lcavol lweight age lbph svi lcp gleason pgg45 |
0.66 0.58 0.45 -0.02 0.11 0.77 -0.10 0.05 -0.005 |
1.29 0.088 0.17 0.01 0.06 0.24 0.09 0.16 0.004 |
0.60 0.02 0.01 0.08 0.07 0.01 0.25 0.77 0.31 |
Table 3: Estimated Coefficients of m1
(d) The 3rd column of Table 3 is titled " St. Error". What is the relevance of these values? What do these values mean? What are they used for?
[3 marks].
(e) Calculate the tstat for the hypotheses about age and lcp. State the null (H0 ) and the alternative hypothesis (Ha ), and clearly state your conclusion.
[5 marks]
The lead investigator of this study intervened, and fitted a new model m2, where the variables svi and gleason where treated differently to model m1. The estimated coefficients are displayed in Table 4.
|
Estimates |
St.Error |
pvalue |
Intercept lcavol lweight age lbph svi-1 lcp gleason-7 gleason-8 gleason-9 pgg45 |
0.91 0.56 0.47 -0.02 0.10 0.75 -0.13 0.27 0.50 -0.06 -0.005 |
0.84 0.09 0.17 0.01 0.06 0.25 0.10 0.21 0.77 0.50 0.004 |
0.28 0.01 0.01 0.06 0.09 0.03 0.19 0.22 0.52 0.91 0.29 |
Table 4: Estimated Coefficients of m2
2021-12-16