Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Linear Classifiers: Perceptron and Logistic Regression

DS4400: Machine Learning I

Summer-II 2023

Homework-II (Part-I)

Due Date: Monday 7/31/2023 by 11:59 PM

Introduction

In this assignment, we’ll implement and evaluate the perceptron and logistic regression algorithms for binary classification where the goal is to predict one of two possible outcomes for each data instance.  Your implementation should be in python.  You may use any libraries you wish for visualizing results, but your algorithms should be hand-crafted from scratch and not usescikit-learn or other ML libraries.

Instructions

1.   Read the Breast Cancer dataset into a pandas data frame (features) and a pandas series (targets). Usesklearn’s datasets module to load the dataset (link). Make sure that there are thirty features, and the target has two unique values.

2.   Prior to fitting your models to data, standardize all input features by using sklearn’s StandardScaler().

3.   Implement a function metrics(y, ypred) which, given a series of actual labels (y), and a series of predicted outcomes (ypred), returns the model accuracy, sensitivity, specificity, precision, and f1-score. (10 points)

4. Perceptron: Implement the perceptron algorithm, as a class in Python. Your class should have a fit() and a predict() function. The signature for the two functions should be as follows:

a. Perceptron.fit(train_data, labels, learning_rate=0.0001, epochs=1000) fits to the provided training data using the specified learning rate and the maximum  number of epochs. Once run the class object will have a field that stores the     weight vector (20 points)

b. Perceptron.predict(test_data) return binary labels I.e., 1 and - 1 for the test data instances (20 points)

c.    Functions should have a complete docstring briefly describing the function contract, input parameters and their types and the return types where applicable. (5 points)

d.   The class should have error checking fore.g., if predict is called without calling fit first, then an error should be raised.  Think what other errors are possible and include code to appropriately handle these errors. (5 points)

e.   Fit the model to the complete dataset and visualize the progress of your perceptron algorithm by plotting (10 points):

i.   Mean Perceptron Loss for each epoch.

ii.   Number of mistakes in each epoch

f.    Report the five metrics for your model when run against the Breast Cancer dataset.

i.   10-fold random cross validation stratify using sklearn

ii.   10-fold stratified cross validation strategy using sklearn (link)

iii.   Compare the performance of the two cross-validation strategies.

iv.   Which of the two estimates of performance would you report / believe?

5. Logistic Regression: Implement the logistic regression algorithm, as a class in Python. Your class should have a fit() and a predict() function. The signature for the two

functions should be as follows:

a. ‘LogisticRegression.fit(train_data, labels, learning_rate=0.0001, max_iter=1000, lamdba=0.0) fits to the provided training data using gradient descent with the specified learning rate and the maximum number of iterations. Once run the class object will have a field that stores the weight vector (20 points)

b. LogisticRegression.predict(test_data) returns binary labels (1 and 0) for the test_data (20 points)

c. LogisticRegression.predict_prob(test_data) returns P(y=1 | x_test). Explanation: the output of logistic regression is a number in the range [0,1] that can be interpreted as the probability that a data instance belongs to the positive class (y=1). For input test data that has N instances the function should return a numpy 1-d array with the probability values in order (10 points).

d.   Functions should have a complete docstring briefly describing the function contract, input parameters and their types and the return types where applicable. (5 points)

e.   The class should have error checking fore.g., if predict is called without calling fit first, then an error should be raised.  Think what other errors are possible and include code to appropriately handle these errors. (5 points)

f.    Fit the model to the complete dataset and visualize the progress of your gradient descent by plotting (10 points):

i.   Mean Cross Entropy for each iteration.

g.   Report the five metrics for your logistic regression model when run against the Breast Cancer dataset. Use:

i.   10-fold random cross validation stratify using sklearn

ii.   10-fold stratified cross validation strategy using sklearn (link).

iii.   Compare the performance of the two cross-validation strategies.

iv.   Which of the two would you report / believe?

6.   Contrast the behvior of the two algorithms on the same dataset. Is there a difference in the loss profile during training? What causes the difference and which algorithm would you use in a real-world scenario? (10 points)

Submit

•   Code (python files or Jupyter notebooks). If submitting ajupyter notebook, include a PDF version of your notebook to facilitate grading.

•   All visualizations.  If you are using Jupyter, these can be embedded in your notebook