关键词 > ECE685D/COMPSCI675D

ECE 685D/COMPSCI 675D HW2

发布时间:2022-10-13

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECE 685D/COMPSCI 675D HW2

Note: To turn  in  your  homework,  save  your  jupyter  notebook  with  all  of  the  output  as  a  pdf,  make  sure that this pdf contains all necessary derivation and  upload  to  Sakai.  Turn  in  both  the  pdf  and  the jupyter notebook to sakai. All necessary coding documentation can be found in https://pytorch.org/ docs/stable/index.html.  We  highly  encourage  you  submit  your  derivation  in  LATEX;  if  you choose to submit handwritten derivation, please make sure your derivation is legible. Add your handwritten notes in the jupyter notebook template at it’s designated space

Problem 1: Derive Back Propagation (20pt)

With input features x and it tries to predict a continuous label y, we define function f as a k layer neural network with activation function g that is applied coordinate-wise (we do not explicitly define an activation function here, denote the gradient of g(x) w.r.t x as ∂g(x) wherever it is needed) and weights

w as the following:

h1 = g(wT x + b1) h2 = g(wT h1 + b2)

.

yˆ = wT hk1 + bk

Let k be 3, and the loss function (f (xw), y) = (y yˆ)2 = L to be the MSE1, derive the gradients for

weights ∂L , ∂L , ∂L and bias ∂L , ∂L , ∂L .

w3

w2

w1

∂b3

∂b2

∂b1

Problem 2: Implement Neural Network (20pt)

Use pyTorch to implement the neural network structure described above, and let activation g be the Relu activation. Define a training function that intakes the model, optimizer, necessary dataloaders, and number of epochs to train the Neural Network.

Load the data from UCI Wine dataset (https://archive.ics.uci.edu/ml/datasets/wine+quality), complete necessary preprocessing as you see fit, and split your data into train-validation-test set with an 64-16-20 split.

Finally, train and validate your neural network using a simple SGD optimizer with learning rate α = 103. Plot the training loss and validation loss over your training epoch and comment on your model fit. Also, comment on if it makes sense to treat the wine quality as a continuous label.

Problem 3: Studying the Optimizer, Overfitting, and Underfitting (20pt)

Repeat above process and experiment with two other optimizers discussed during lectures. Explore the optimization with learning rate decay 0.1, 0.001, 0.0001 using a StepLR scheduler with step size=30, and the different learning rate α  102, 103, 104 . Report the MSE of your experiments. Comment on which combination of optimization parameters is the best (show evidence through plots or tables), comment on model convergence, convergence rate (heuristically), overfitting, and underfitting of your neural network.


Problem 4: The Effect of Batch Size(10pt)

Using the best combination of optimizer, learning rate, and the usage of learning rate decay, refit the model with different batch sizes 1, 256, 1024 . Plot the training loss and validation loss over the epoch and comment on why the results are different. Use the best model to report the test MSE.

Problem 5: More on Optimizers (30pt)

In this problem, we want to fit a multinomial Logistic Regression (LR) model to the MNIST dataset using L2-regularized cross-entropy loss function.

Let K be the number of classes (for MNIST K = 10) and let D = xn, yn , n = 1, . . . , N denote the

dataset where yn is the label of the nth data point, represented as a K-dimensional vector such that its kth entry is 1 if the data point belongs to class k and 0 otherwise (i.e., 1  of   K  encoding) whereas xn is the corresponding d-dimensional input vector (flattened MNIST image). Then, the cross-entropy loss is given by:


Σ T 2

   exp(ai) 

where . F denotes the Frobenius norm. Write a Python program that fits the model using the following optimization methods:

1. Momentum method with parameter β = 0.9 (5 pts)

2. Nesterov’s Accelerated Gradient (NAG) with parameter β = 0.95 (5 pts)

3. RMSprop with parameters β = 0.95, γ = 1 and ϵ = 108 (10 pts)

4. Adam with parameters β1 = 0.9, β2 = 0.999 and ϵ = 108 (10 pts)

with learning rate η = .001 (note that the learning rate in the lecture notes is denoted by α) and batch size 100. Report the classification accuracy on the test data set for λϵ 0.01, 0.1, 1 .

Note: You can use the Autograd package from Pytorch to compute the gradient. However, you are NOT allowed to use any built-in optimizers.

 (Bonus) Problem 6: Logistic Regression using Newton’s Method (20pt)

In this problem, we are going to fit a Logistic Regression model on the Breast Cancer dataset (use sklearn.datasets to load dataset) in using Newton’s method. For this purpose we will use the binary cross entropy loss given by:

N

L(Dw) = yn

n=1

log(σ(wT xn

))  (1  yn

) log(1  σ(wT xn

  1

)) where, σ(a) = 1 + e(a)

(2)

where w = (w1, . . . , wd) and D = {xn, yn}, n = 1, . . . , N where yn is the label of the nth data point whereas xn is the corresponding feature vector.

Compute gradient: ∂L(D;w) and hessian: 2 L(D;w) .

wi wiwj

Using the gradients and hessian computed above, write a Python program that implements Newton’s method to fit the Logistic Regression model on the Breast Cancer dataset implementing

1. Exact expression of the Hessian

2. Diagonal approximation of Hessian

Plot and compare the loss and accuracy graphs on test set (learning rate = 0.1, test size = 0.3 and random state = 42) for both methods.