关键词 > ETC3250/5250

ETC3250/5250 Assignment 3

发布时间：2022-05-10

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ETC3250/5250 Assignment 3

1. (8pts) Classiﬁcation trees

The palmerpenguins is a new R data package, with interesting measurements on penguins of three different species. Subset the data to contain just the Adelie and Chinstrap species, and only the variables species and the four physical size measurement variables.

a. (1pt) Write down the equation for the Gini measure of impurity, for two groups, and the parameterp which is the proportion of observations in class 1. Specify the domain of the function, and determine the value ofp which gives the maximum value, and report what that maximum function value is.

YOUR ANSWER

b. (1pt) For two groups, how would the impurity of a split be measured? Give the equation.

YOUR ANSWER

c. (1pt) Write an R function to compute the Gini impurity for a particular split on a single variable. Show the code of your function here. Make sure to include a minsplit parameter, which will prevent splitting on the edges fewer than the speciﬁed number of observations.

YOUR ANSWER

d. (2pts) Apply your function to compute the value for all possible splits for the body mass ( bm ), setting minsplit to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable.

YOUR ANSWER

e. (3pts) Use your function to compute the ﬁrst two steps of a classiﬁcation tree model for separating Adelie from Chinstrap penguins, after setting minsplit to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments. (If you are not conﬁdent about your function and coding, you can use the rpart library to determine the ﬁrst two splits, for partial marks.)

YOUR ANSWER

2. (8pts) Random forests

In this question you will investigate the random forest model ﬁtted to the penguins data, and examine variable importance, vote matrix, and boundary. You do not need to break the data into training and testing.

a. (1pt) Fit the default random forest model, using the randomForest package. Report the overall error rate and the confusion matrix. (Pay attention to the number of penguins that are misclassiﬁed in each class.)

YOUR ANSWER

b. (2pts) Extract the predicted values from the model ﬁt, and list the row numbers of the penguins that are misclassiﬁed. Examine the vote matrix, ﬁnd these penguins, and list the proportion of times each

of these observations is classiﬁed to each class. Is this what you would expect? Write a couple of sentences on why or why not. (Limit 40 words.)

YOUR ANSWER

c. (1pt) Use the predict( ) function to predict the values of the full set. Did you ﬁnd that each observation is perfectly predicted? Why do these predictions differ from the predicted values in the model object? (Limit 40 words.)

YOUR ANSWER

d. (2pts) Examine the permutation variable importance. Which variable is the most important? Would it be reasonable to use just this variable to separate the two groups? Explain why or why not, including making a plot to support your argument. (Limit 40 words.)

YOUR ANSWER

e. (2pts) Create a linear combination of the four variables based on the variable importance, to make a 1D projection of the data to examine the separation. Explain how you would do this, and then do it. Plot the species against your linear combination. Would you expect that there is a value on this linear combination where the species are separated with only the same number of misclassiﬁed penguins from part a, on the wrong side of the boundary? Why or why not, and justify your answer. (Hint: you need to think about standardising the variables for this question.) (Limit 40 words.)

YOUR ANSWER

3. (8pts) Support vector machines

Using the code below, simulate some data with which to answer the questions.

library(tidyverse)

library(knitr)

library(kableExtra)

set.seed(2022 )

df <- tibble(id=as.character(1 :12 ),

class=factor(c(rep("A" , 6 ), rep("B" , 6 ))),

x1=c(sample(1 :10 , 12 , replace = T )),

x2=c(sample(1 :3 , 6 , replace = T ),

sample(8 :10 , 6 , replace = T )))

df <- bind_rows(df, tibble(id=as.character(c(13 , 14 , 15 )),

class=factor(c("A" , "A" , "B")),

x1=c(4 , 6 , 5 ),

x2=c(4 , 4 , 7 )))

a. (1pt) We are given n = 15 observations inp = 2 dimensions. For each observation, there is an associated id and class label. Sketch the observations.

YOUR ANSWER

b. (2pts) Sketch what you think is the optimal separating hyperplane, and provide the equation for this hyperplane in the form of textbook equation 9.1. On your sketch, indicate the margin for the maximal margin hyperplane, and indicate what you think would be the support vectors.

YOUR ANSWER

c. (1pt) Sketch a separating hyperplane that is not the optimal separating hyperplane, and provide the equation for this hyperplane.

YOUR ANSWER

d. (2pts) Using the svm function of the e1071 package ﬁt the linear svm model to this data. What cost value would yield three support vectors? Use this for your model ﬁt. Using entirely R code, and provide this code here, mark the support vectors, compute and plot the equation of the separating hyperplane. Discuss how the result compares with your hand calculation. (Limit 40 words.)

YOUR ANSWER

e. (2pts) Add an additional 30 variables to the data, that are similar to x1 in that there is no difference between the two classes - what we would call noise variables. What happens to the coefﬁcient for x2 when you ﬁt the svm model? Explain why this happens. (Limit 75 words.)