DS4400: Machine Learning I Summer-II 2023 Homework-III
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
SVMs and K-Means
DS4400: Machine Learning I
Summer-II 2023
Homework-III
Due Date: Sunday 8/13/2023 by 11:59 PM
Introduction
In this assignment, we’ll work with support vector machines (SVMs) using different kernels. For the SVM part you will be using sklearn’s built-in methods for an end-to-end evaluation on a synthetic binary classification dataset.
Additionally, you will implement your own version of k-means clustering algorithm and evaluate it on asynthetic dataset .
Instructions
1. Support Vector Machines (SVMs): To train an SVM with a given kernel, we need to find the optimal value for the model hyperparameters. The C-SVM implementation has one hyperparameter C, that can be set using grid search. For each SVM-kernel combination report the optimal hyperparameter(s) and the ten -fold cross validation AUC.
Hyperparameter search should be done only using the training dataset, and therefore would be nested inside the cross-validation loop. Provide hyperparameter estimates for all tenfolds.
a. Load the dataset_1.csv file. The dataset has only two features and the target is binary {- 1, 1}.
b. Train an SVM with a Linear kernel (10 points)
c. Train an SVM with a Polynomial kernel (degree 2) (10 points)
d. Train an SVM with a Radia Basis Kernel (hyperparameter gamma/scale). For this implementation the grid search would be done for both C and gamma jointly. (20 points)
e. How stable are the hyperparameter estimates and what causes the stability / variance for this dataset? (5 points)
f. Using the mean hyperparameter estimates, train the final SVM and plot the SVM decision boundaries and support vectors for all three kernels. (10 points)
2. K-Means Clustering:
a. Load the dataset_blobs.csv as a pandas dataframe. The dataset has only two features (x1, x2), ignore the third column which has the true cluster ids.
b. Implement the k-means clustering algorithm as a class. Make the number of clusters as the required parameter for the constructor.
c. The fit() function should accept a numpy array with data in rows I.e., the design matrix (N x M) where N is the number of data instances and M is the number of features. The function should store the estimated centers. (30 points)
d. The predict() function should output integer cluster ids for input data (identical to the fit() function) (20 points)
e. Varying the number of clusters from 2 to 10 find the optimal number of clusters for the blobs dataset using the elbow heuristic. (15 points)
Submit
• Code (python files or Jupyter notebooks). If submitting ajupyter notebook, include a PDF version of your notebook to facilitate grading.
2023-08-10