Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INDIVIDUAL ASSIGNMENT - 2

This is an individual assignment. You are expected to submit your own code, output, and your comments in one PDF file. You are allowed to have a broad discussion with other students. However, you cannot copy others' codes and/or report. You are expected to write your own code and produce your report as a pdf file which also shows output of your code and explicit answers based on your code and output. Use R Markdown to produce this report. If you encounter trouble with R Markdown, transfer your code, output, and comments to an HTML or Word document and convert that file to a PDF file. The report should identify all the collaborators and the nature of the collaboration. Copying reports and codes from other students, AI tools like ChatGPT, solution manuals, and online resources will be considered a violation of the honor code. Students are prohibited from utilizing solutions posted in the prior years. Students are prohibited from obtaining a copy of the solution from other students or TAs.

Exercise 6.8: Problem 8 (parts e & f)

For Context, Refer to Problem 8 (parts a, b, c, & d).

(e) Now fit a lasso model to the simulated data, again using X, X2, . . . , X10 as predictors. Use cross-validation to select the optimal value of λ. Create plots of the cross-validation error as a function of λ. Report the resulting coefficient estimates, and discuss the results obtained.

(f) Now generate a response vector Y according to the model

Y=β0+β7X7+ϵ,

and perform best subset selection and the lasso. Discuss the results obtained.

Exercise 8.4: Problem 8 (parts a, b, & c)

Problem #8: In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

Exercises 8.4

Problem #8: In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

Problem #10: We now use boosting to predict Salary in the Hitters data set.

(a) Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

(b) Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

(c) Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter λ. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

(d) Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

(e) Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

(f) Which variables appear to be the most important predictors in the boosted model?

(g) Now apply bagging to the training set. What is the test set MSE for this approach?

Problem 11.4

Galit's Book p. 294

Direct Mailing to Airline Customers. East-West Airlines has entered into a partnership with the wireless phone company Telcon to sell the latter’s service via direct mail. The file EastWestAirlinesNN.csv Download EastWestAirlinesNN.csv contains a subset of a data sample of who has already received a test offer. About 13% accepted.

You are asked to develop a model to classify East-West customers as to whether they purchase a wireless phone service contract (outcome variable Phone_Sale). This model will be used to classify additional customers.

1. Run a neural net model on these data, using a single hidden layer with 5 nodes. Remember to first convert categorical variables into dummies and scale numerical predictor variables to a 0-1 (use function preprocess() with method=“range” - see Chapter 7). Generate a deciles-wise lift chart for the training and validation sets. Interpret the meaning (in business terms) of the leftmost bar of the validation decile- wise lift chart.

2. Comment on the difference between the training and validation lift charts.

3. Run a second neural net model on the data, this time setting the number of hidden nodes to 1. Comment now on the difference between this model and the model you ran earlier, and how overfitting might have affected results.

4. What sort of information, if any, is provided about the effects of the various variables?

Exercise 10.7: Problem 9

ISLR p.417

9. Consider the USArrests data. We will now perform hierarchical clustering on the states.

(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.

(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?

(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.

(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.