Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Machine Learning Assignment

Overview

In this assignment, you will perform classification and regression analysis using the machine learning methods covered in class.

The datasets are provided to you in two excel files, “reviewer_withavg.csv” and “bank- fullshuffled.csv”.

Preparation

You will use R for this assignment. In order to successfully complete the assignment, please ensure you understand the R code that accompanies the class lectures. You should have the   following packages installed to run the R code covered in class and/or for your project: class, klaR, e1071, rpart,randomForest, leaps, lars, MASS.

Part I: Classification using k-NN, Naïve Bayes, and SVM

In this part of the assignment, you will use the reviewer dataset. First, create a column  “Popular” in the dataset. Any reviewer who is Top10, Top50, or Top100 is considered popular. That is, set Popular to 1 if any of Top10, Top50, or Top100 is 1, and 0 otherwise.

The task is to classify a reviewer as either Popular or not Popular, using the four characteristics of the reviewer as shown in class: avg_centrality, avg_content, avg_viewership, avg_enhconent. Youwill use the first 50 records as the testing dataset, and the last 149 records as the training    dataset.You will evaluate three classification methods: k-NN, Naïve Bayes, and SVM. For k-   NN, you need to test values for k between 1 and 10. For SVM, you will test both linear and polynomial kernels.

Please answer the following questions:

•   For k-NN, which k value minimizes classification error in the training dataset? What is the error count?

•   For k-NN, which k value minimizes classification error in the testing dataset? What is the error count?

•   For Naïve Bayes, what is the error count for the testing dataset?

•   For SVM, what are the error counts for the training dataset using linear and polynomial kernels, respectively?

•    For SVM, what are the error counts for the testing dataset using linear and polynomial kernels, respectively?

•   Judging by error count, which of the above algorithms has the best performance?

Part II: Classification using Tree and Random Forest

In this part of the assignment, you will use a reshuffled bank dataset. The name of the data file is “bank-full-shuffled.csv” . You will use the first 80% of the records as the training dataset, and     the remaining 20% as the testing dataset.

The task is to predict the variable y” (valued yes” or “no”), using the following characteristics: duration, month, poutcome, job, education, and marital. Please perform the following tasks and  answer the corresponding questions:

•   Fit a classification tree to predict y using the list of characteristics specified above. Use   cp=0.0001. Please report the confusion matrices and the F1 scores for the training dataset and that for the testing dataset.

•   Prune the tree fitted in the previous step. What complexity parameter should you set? Why? (Please show evidence to support your answer.) Please report the confusion matrices and the F1 scores of the pruned tree for the training and testing datasets. How    does the predictive performance of the pruned tree compare with that of the original tree?

•   Fit a random forest to predict the same, using 50 trees. Please report the confusion matrices and the F1 scores for the training and testing datasets. How does the predictive performance compare with that of the original and the pruned tree? What are the two most important independent variables?

Important note: Please ensure that you use the shuffled dataset, “bank-full-shuffled.csv”,

not the file bank-full.csv” which we used in class. Using the wrong file will lead to incorrect answers.

Part III: Regression

In this part of the assignment, you will use the reviewer dataset, to predict “avg_content” of a   reviewer using the following characteristics of the reviewer: avg_centrality, avg_viewership, avg_enhconent, Top10, Top50, Top100, Advisor, Lead. We will use the entire dataset to fit the model.

The task is to find out which variables should be included in the regression model for predicting “avg_content” . Please use both full model search and forward-stepwise regression for this task.

Themodel selection criterion is Mallow’s Cp.

Please answer the following questions:

•   What is the best model according to full model search?

•   What is the best model according to forward-stepwise regression?

•   What variables should be included if we want to use only three independent variables? Is the set the same for full model search and for forward-stepwise regression? What variables should be included if we want to use five independent variables?


Deliverable


Please submit your R code and answers to the questions. Please submit them in a text file (either plain text or Word document is fine), putting the R code in the appendix.