Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit



Coursework 2

Statistics for Data Analysis

2021/2022



1. The dataset used for this question is from 97 men with prostate cancer who were due to receive a radical prostatectomy.  Before proceeding with the operations the doctors wanted to find out how a prostate specific antigen is affected by the other variables measured in the study. Table 1 displays each variable together with a detailed description. Table 2 displays the variance inflation factor for each predictor.



Acronym

Decription

lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

lpsa

log(cancer volume)

log(prostate weight)

age of patient

log(benign prostatic hyperplasia amount)

seminal vesicle invasion (1 for having the invasion, 0 otherwise) log(capsular penetration)

Gleason rank (6, 7, 8, 9)

combined percentage of Gleason patterns 4 or 5

log(prostate specific antigen)

Table 1: Prostate data:  Description of the variables



lcavol

lweight

age

lbph

svi

lcp

gleason

pgg45

2.05

1.36

1.32

1.37

1.95

3.10

2.47

2.97

Table 2: Prostate Data - Variance Inflation Factors (VIF)


(a) Why do you think the doctors considered the logarithms of most continuous vari- ables instead of the actual measurements?

[2 marks]

(b) What is a Variance Inflation Factor (VIF)? What is it used for and what is the cut-off value that causes concern?  Is there anything to worry about based on the VIF values of Table 2.

[4 marks]

A medical student ”fitted” the linear multiple regression model to this data, without taking into account the fact that variables svi and gleason are not really quanti- tative. Table 3 displays the least squares estimates of this model (m1).


(c)  Now focus on the variables svi and gleason.  Do the estimates make sense?  Can you say with certainty how a change from ranking 6 to ranking 7 affects prostate specific antigens and  how that differs from comparing  ranking 6 to  ranking 8? Explain your answer.

[4 marks]



Estimates

St.Error

pvalue

Intercept lcavol lweight age lbph svi

lcp

gleason

pgg45

0.66

0.58

0.45

-0.02

0.11

0.77

-0.10

0.05

-0.005

1.29

0.088

0.17

0.01

0.06

0.24

0.09

0.16

0.004

0.60

0.02

0.01

0.08

0.07

0.01

0.25

0.77

0.31

Table 3: Estimated Coefficients of m1


(d) The 3rd column of Table 3 is titled " St.  Error".  What is the relevance of these values? What do these values mean? What are they used for?

[3 marks].

(e)  Calculate the tstat  for the hypotheses about age and lcp. State the null (H0 ) and the alternative hypothesis (Ha ), and clearly state your conclusion.

[5 marks]

The lead investigator of this study intervened, and fitted a new model m2, where the variables svi and gleason where treated differently to model m1.  The estimated coefficients are displayed in Table 4.



Estimates

St.Error

pvalue

Intercept lcavol lweight age lbph

svi-1

lcp gleason-7 gleason-8 gleason-9 pgg45

0.91

0.56

0.47

-0.02

0.10

0.75

-0.13

0.27

0.50

-0.06

-0.005

0.84

0.09

0.17

0.01

0.06

0.25

0.10

0.21

0.77

0.50

0.004

0.28

0.01

0.01

0.06

0.09

0.03

0.19

0.22

0.52

0.91

0.29

Table 4: Estimated Coefficients of m2