关键词 > ECE685D/COMPSCI675D

ECE 685D/COMPSCI 675D HW2

发布时间：2022-10-13

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECE 685D/COMPSCI 675D HW2

Note: To turn in your homework, save your jupyter notebook with all of the output as a pdf, make sure that this pdf contains all necessary derivation and upload to Sakai. Turn in both the pdf and the jupyter notebook to sakai. All necessary coding documentation can be found in https://pytorch.org/ docs/stable/index.html. We highly encourage you submit your derivation in LATEX; if you choose to submit handwritten derivation, please make sure your derivation is legible. Add your handwritten notes in the jupyter notebook template at it’s designated space

Problem 1: Derive Back Propagation (20pt)

With input features x and it tries to predict a continuous label y, we define function f as a k layer neural network with activation function g that is applied coordinate-wise (we do not explicitly define an activation function here, denote the gradient of g(x) w.r.t x as ^∂g^(x) wherever it is needed) and weights

w as the following:

h₁ = g(w^T x + b₁) h₂ = g(w^T h₁ + b₂)

yˆ = w^T h_k₋₁ + b_k

Let k be 3, and the loss function ℓ(f (x; w), y) = (y yˆ)² = L to be the MSE¹, derive the gradients for

weights^∂L ,^∂L ,^∂L and bias^∂L ,^∂L ,^∂L .

∂w3

∂w2

∂w1

∂b3

∂b2

∂b1

Problem 2: Implement Neural Network (20pt)

Use pyTorch to implement the neural network structure described above, and let activation g be the Relu activation. Define a training function that intakes the model, optimizer, necessary dataloaders, and number of epochs to train the Neural Network.

Load the data from UCI Wine dataset (https://archive.ics.uci.edu/ml/datasets/wine+quality), complete necessary preprocessing as you see fit, and split your data into train-validation-test set with an 64-16-20 split.

Finally, train and validate your neural network using a simple SGD optimizer with learning rate α = 10⁻³. Plot the training loss and validation loss over your training epoch and comment on your model fit. Also, comment on if it makes sense to treat the wine quality as a continuous label.

Problem 3: Studying the Optimizer, Overfitting, and Underfitting (20pt)

Repeat above process and experiment with two other optimizers discussed during lectures. Explore the optimization with learning rate decay 0.1, 0.001, 0.0001 using a StepLR scheduler with step size=30, and the different learning rate α 10⁻², 10⁻³, 10⁻⁴ . Report the MSE of your experiments. Comment on which combination of optimization parameters is the best (show evidence through plots or tables), comment on model convergence, convergence rate (heuristically), overfitting, and underfitting of your neural network.

Problem 4: The Effect of Batch Size(10pt)

Using the best combination of optimizer, learning rate, and the usage of learning rate decay, refit the model with different batch sizes 1, 256, 1024 . Plot the training loss and validation loss over the epoch and comment on why the results are different. Use the best model to report the test MSE.

Problem 5: More on Optimizers (30pt)

In this problem, we want to fit a multinomial Logistic Regression (LR) model to the MNIST dataset using L₂-regularized cross-entropy loss function.

Let K be the number of classes (for MNIST K = 10) and let D = x_n, y_n , n = 1, . . . , N denote the

dataset where y_n is the label of the n^th data point, represented as a K-dimensional vector such that its k^th entry is 1 if the data point belongs to class k and 0 otherwise (i.e., 1 of K encoding) whereas x_n is the corresponding d-dimensional input vector (flattened MNIST image). Then, the cross-entropy loss is given by:

Σ T 2

exp(a_i)

where . _F denotes the Frobenius norm. Write a Python program that fits the model using the following optimization methods:

1. Momentum method with parameter β = 0.9 (5 pts)

2. Nesterov’s Accelerated Gradient (NAG) with parameter β = 0.95 (5 pts)

3. RMSprop with parameters β = 0.95, γ = 1 and ϵ = 10⁻⁸ (10 pts)

4. Adam with parameters β₁ = 0.9, β₂ = 0.999 and ϵ = 10⁻⁸ (10 pts)

with learning rate η = .001 (note that the learning rate in the lecture notes is denoted by α) and batch size 100. Report the classification accuracy on the test data set for λϵ 0.01, 0.1, 1 .

Note: You can use the Autograd package from Pytorch to compute the gradient. However, you are NOT allowed to use any built-in optimizers.

(Bonus) Problem 6: Logistic Regression using Newton’s Method (20pt)

In this problem, we are going to fit a Logistic Regression model on the Breast Cancer dataset (use sklearn.datasets to load dataset) in using Newton’s method. For this purpose we will use the binary cross entropy loss given by:

L(D; w) = −y_n

n=1

log(σ(w^T x_n

)) − (1 − y_n

) log(1 − σ(w^T x_n

)) where, σ(a) = 1 + e(−a)

(2)

where w = (w₁, . . . , w_d) and D = {x_n, y_n}, n = 1, . . . , N where y_n is the label of the n^th data point whereas x_n is the corresponding feature vector.

Compute gradient: ^∂L⁽^D^;^w⁾ and hessian: ^∂2 L(D;w) .

∂wi ∂wi∂wj

Using the gradients and hessian computed above, write a Python program that implements Newton’s method to fit the Logistic Regression model on the Breast Cancer dataset implementing

1. Exact expression of the Hessian

2. Diagonal approximation of Hessian

Plot and compare the loss and accuracy graphs on test set (learning rate = 0.1, test size = 0.3 and random state = 42) for both methods.