Statistical Learning Assignment 2 - Semester 1, 2021


• INSTRUCTIONS:

1. The assignment must be typed (not handwritten). You may either use Microsoft Word (or similar) or R markdown in RStudio for the assignment. Note that the final project will require the use of R markdown (or Sweave). When answering this question, it should be no longer than 10 A4 pages [single sided] with a font size no smaller than 11 point.

2. The assignment due date is listed on the Wattle (Turn-it-in) site. Upload the assignment through Wattle using Turn-it-in. You should submit your assignment in two different parts. Note that there are two tabs for Turnitin. If you are using R markdown:

(a) A pdf file [or HTML file] of your assignment (this should include important R code to highlight what you have done).

(b) A ‘.Rmd’ file [an R markdown file].

If you are using Microsoft Word (or similar):

(a) A Word file of your assignment (this should include important R code to highlight what you have done).

(b) A ‘.R’ file of your R code.

3. In answering the questions, write your answers clearly and succinctly. Use appropriate graphs and tables when you think they help to describe your point or thinking process. Do not just “print” a set of results. Every result should be discussed and have a reason for being presented. No points will be awarded unless you clearly discuss what you are doing.

4. No late assignments will be accepted.

5. You should not discuss the assignment (questions, solutions, code, etc.) with your classmates or other individuals. You can discuss these with me or your tutor (Dr. Ha Nguyen) during our consultation times. You must independently write your own solutions. This includes all computer code, English, and mathematics. University policies on academic integrity will be strictly enforced. See http://www.anu.edu.au/students/program-administration/assessments-exams/academic-honesty-plagiarism for more details.

6. Have fun with the exploration!


1. (75 points) We will explore some of the techniques you have learned thus far by examining heart disease data which are on Wattle. Some more information may be found at https://archive.ics.uci.edu/ml/datasets/Heart+Disease. The variables are:

• age: age in years

• sex: sex (1 = male; 0 = female)

• cp: chest pain type

– Value 0: typical angina

– Value 1: atypical angina

– Value 2: non-anginal pain

– Value 3: asymptomatic

• trestbps: resting blood pressure (in mm Hg on admission to the hospital)

• chol: serum cholestoral in mg/dl

• thalach: maximum heart rate achieved

• thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

• (Y ) condition: 0 = no disease, 1 = disease


(a) (10 points) Using the training data, conduct an exploratory data analysis. In doing your analysis make sure to identify any unusual points and discuss why they are unusual. For this assignment do not remove any unusual points, only comment on them (if they exist). You may also consider any transformations of the covariates. For the rest of the assignment, if you believe the transformations are appropriate (provide justification - this can simply be a discussion), use those transformations.

(b) Consider a logistic regression model to examine the relationship between whether a patient has heart disease (Y = 1) or not (Y = 0) and their covariate information (x).

i. (10 points) Use k-fold cross-validation to determine your “best” model based on the lowest miss-classification rate. You can choose k to make the sample sizes in the folds equal, as long as k is at least 3. While you may use the glm() function in R, write your own code for the cross-validation. Provide the estimated cross-validation miss-classification rate CV(k) and standard error. Present this as [CV(k) , CV(k) − SE(CV(k)), CV(k) + SE(CV(k))]. Use a forward selection search process for your model search. Make sure the results for all the covariates are provided. You do not need to consider any interactions, however you may consider non-linearities in the form of a quadratic polynomials.

ii. (5 points) Using the test data, provide the confusion matrix based on the “best” model from 1(b)i. Compute the overall miss-classification rate, as well as the false-positive and false negative rates. Does changing the “threshold” help with these?

iii. (10 points) From your “best” model in 1(b)i, provide a full discussion of your model. Addi-tionally, provide 95% confidence intervals for the regression coefficients.

iv. (5 points) Without using the boot() function, write your own R function which takes the training data set and the number of bootstrap samples as inputs, and returns bootstrap standard errors and 95% confidence intervals for the regression coefficients from your “best” model. Compare these results to the estimated asymptotic standard errors and confidence intervals produced from using glm().

v. (5 points) Using the bootstrap approach (using any R functions you believe will help), provide 95% confidence intervals for the predicted probability of heart disease for the first 10 patients in the testing data.

(c) (15 points) Repeat 1(b)i, 1(b)ii, and 1(b)v using linear discriminant analysis instead of logistic regression. You may use any R functions that you believe will help. This means you can write your own functions or use those already in R.

(d) (15 points) Repeat 1(b)i, 1(b)ii, and 1(b)v, using quadratic discriminant analysis instead of logistic regression. You may use any R functions that you believe will help. This means you can write your own functions or use those already in R.

2. (25 points) Reconsider the Ames, Iowa housing data. You may now use all of the covariates that are available. You do not need to consider any interactions or non-linearities.

(a) (10 points) Fit a ridge regression model. Present the results and key findings. Submit your prediction to Kaggle and report your error rate.

(b) (10 points) Fit a lasso regression model. Present the results and key findings. Submit your prediction to Kaggle and report your error rate.

(c) (5 points) Which model seems to fit the data better? What might this suggest?