Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COSC 578 Statistical Machine Learning HW3 (Foundation)

Spring 2023

Introduction

• This is an individual homework. No any form of cheating is allowed.

• If you need help, please schedule office hours with your TA.

  Start EARLY.

• This homework covers ESML chapters 7.

• You will work on model selection and cross-validation.

• Our focus is on practical use of these methods for machine learning.

• No coding is required for the solutions.

• All work must be typed or computer-generated, including the graphs. Do not write or draw by hand.

• If you would like to code to get the solutions, please feel free to do so.  You can choose any programming language that you like.

• Due Date:  Wednesday April 26, 2023 11:59 PM Eastern Time

 Where to Submit: Canvas

• What to Submit: It is described at the end of this document.

Problem:  Cross-Validation

1. Download the data file from piazza.  It contains 96 rows of data entries.  The last column is ground truth of the output value that you need to predict. The first column is the entry index. The second to the second last columns are input features.

2. You will compare three linear models/predictors on this dataset: 2.1. Model 1:  = X βˆ, and βˆ = [2, 1, 50, 1].

2.2. Model 2:  = X βˆ, and βˆ = [1, 80, 3, −60].

2.3. Model 3: Randomly split the dataset into a training (90 %) and a test set (10 %). Estimate

a linear model on the training data that predicts the output (Y) given all other variables (X1, ..., X4). Call your model “Model 3”. Report what your model is.

3. Plot 1: Mean Squared Error vs. Feature Subsets

3.1. Calculate the mean squared error (MSE) between the predicted values and the actual values

on the test set.  Repeat the analysis for M = 10 different random splits and report the average. Report the MSE for all three models.

3.2. Plot different MSE curves for all 3 models using a different subset of features (when subset

size p = 1, .  .  .  , 4). In your plot, the x-axis is the number of parameters, i.e. subset size p. The y-axis is MSE error. Use different color and shapes to distinguish the curves. Add legend to the plot too. (See ESML Textbook Figure 7.9 for example)

3.3. Write down your analysis. What do you observe from those curves in 3.4? Which model is

the best? Why? Justify your answers.

4. Plot 2: Mean Squared Error vs. Dataset Size

4.1. Obtain random samples of your entire dataset with 20%, 40%, 60%, 80% and 100%. Repeat

different random splits for M = 10 times and report the average for the following tasks.

4.2. Plot MSE curves on different dataset size for all three models.  In your plot, the x-axis is

dataset size.  The y-axis is MSE error.  Use different color and shapes to distinguish the curves. Add legend to the plot too. (See ESML Textbook Figure 7.8 for example)

4.3. Write down your analysis.  What do you observe from those curves?  Which model is the

best? Why? Justify your answers.

What to Submit?

All work must be typed in or computer-generated graphs. Do not write or draw by hand. Compress the following in a single file and submit it to Canvas.

• Details of your model 3.

• The two plots as required.

• Analysis of your plots as required.