Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


Homework 3 for Math 173A - Fall 2021


1. Consider the function given by

(a) Use any method you like, to show that the function is convex.

(b) Find the extreme points of , and determine whether they are maxima or minima.


2. Using the conditions for optimality, find the extreme points of the following functions and determine whether they are maxima or minima. You may use a computer to find eigenvalues if you like (though the questions are designed to be doable by hand).

(a)

(b)


3. Determine whether each of the functions below is Lipschitz, and if so, find their Lipschitz constant.

(a)

(b)

(c)

Remark: appears in many machine learning applications. In this class, it has already made one appearance, in HW2.

Hint: You may want to use the fact that a function is Lipschitz if its gradient’s norm is bounded (and the gradient in this case is the derivative since this is a single variable function).

(d)  and where is the function from part (c).

Hint: As above, you may want to use the fact that a function is Lipschitz if its gradient’s norm is bounded. You may also want to use the chain rule to calculate the gradient.


4. Let be a convex, differentiable, L- Lipschitz function, with Lipschitz constant 2. Let be a minimizer of and suppose that is such that

(a) Apply the relevant convergence rate theorem (from lecture 7), to determine how many steps t of the gradient descent algorithm need to be run to guarantee that

(b) What is the associated choice of the step-size µ?


5. Computer Problem:    For this problem, you will need to download the MNIST data set. It is a dataset of images of size 28 × 28 pixels. Each one is an image of a handwritten digit from 0-9. You may find it from the original source here:

http://yann.lecun.com/exdb/mnist/

or in convenient .csv file format here:

https://pjreddie.com/projects/mnist-in-csv/

In the last assignment we considered the setting where we are given data (xi, yi) ∈ × {−1, 1}, i = 1..., N. That is, each xi is associated with a class label yi where yi is either 1 or −1. We assumed the model  where zi are independent random variables drawn from a certain distribution, and we introduced the cost function

whose minimizer is the best w that fits your data. You also wrote down a Gradient Descent algorithm for minimizing this function.

In this assignment, you will apply what you learned in assignment 2, on the MNIST data-set consisting of images of handwritten digits. The machine learning goal, once you optimize for w, is to classify new images x ∈ , using the function which in some sense is the best you can hope to do given your model.


Questions:

(0) You must submit all your computer codes as part of this assignment. In particular, for each question, your code must be presented as part of your answer.

(1) Present a randomly selected representative image (from the training data) for each of the 10 handwritten digits. Provide the index number for each image you displayed.

(2) For this question, use the first 500 training data points for each of the digits 0 and 1, to form the pairs (xi, yi) ∈ × {−1, 1}, i = 1, ..., 1000. Assign the label yi = 1 to the 1 digits, and the label yi = −1 to the 0 digits.

Remark: To get from images of size 28×28 pixels to vectors in , you just need to “vectorize” the image. This means you can concatenate each of the 28 columns of the original image into one long vector of length 784.

(a) Implement and run a Gradient Descent algorithm, with step-size , to optimize the function (1) associated with this setup. You should run your algorithm for at least T = 200 iterations (but if your computer can handle it, do more, until a reasonable stopping criterion is satisfied), and provide a plot showing the value of F(w) at each iteration.

(b) Comment on the resulting plot. In particular, does the value of F(w) decrease with every iteration? Does your algorithm seem to be converging to a fixed ? Explain whether your answers to these questions are consistent with the theory we discussed in class (and in the notes). Be specific, i.e., point to a specific theorem (or theorems) and indicate why it does or does not explain the behavior of the algorithm.

(c) Now, use the w you found from part (a) to classify the first 500 test data points associated to each of the 0 and 1 handwritten digits. Recall that you need to use the function  to classify. What was the classification error rate associated with the two digits on the test data (this should be a number between 0 and 1)? What was it on the training data?

(3) Repeat 2(a) and 2(c) above with the digits 4 and 9. Comment on the difference between your results for the digits 0 and 1 versus the digits 4 and 9.