Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DSC 40A -   Homework  2

Problem 1. Adding more Data Points

  Suppose we  have  a dataset y1    y2 yn and want to minimize absolute loss on it for the prediction h. As we’ve seen before, the corresponding empirical risk is mean absolute error,

Rabs

n

(h) = yi n

i=1

h|

Suppose that Rabs(α) = Rabs(α +4) = M , where M is the minimum value of Rabs(h) and α is some constant. Suppose we add to the dataset three new data points yn+1yn+2yn+3 whose values are α + 1, α + 2, α + 3. For this new larger dataset, what is the minimum of Rabs(h) and at what value of h is it achieved? Your answers to both parts should only involve the variables n, M, α, and constants.

Problem 2. Quadratus Lumborum

In class, we used calculus to prove that the mean minimizes mean squared error, Rsq(h) = 1 Σn

(h yi)2.

a)    Recall, the mean squared error Rsq(h) is a quadratic function of h. Any quadratic function of h can be written in the form Ah2 + Bh C, where AB, and C are constants that do not depend on h.

Determine the values of AB, and C such that Rsq(h) = Ah2 + Bh C. Your answers for AB, and

C should only be in terms of the variables n, y1, y2, ..., yn, and any constants.

Hint: Start by expanding Rsq(h).

b) Let’s simplify your expressions for AB, and C.

The variance of our dataset is defined as σ2 =  1 Σn    (yi  y¯)2, where y¯ is the mean of our dataset


(i.e.  y¯ =  1 Σn yi.  An alternative way of writing variance is as follows:

σ2 = . 1 Σ y2Σ  y¯2


Using this fact, simplify your expressions for A, B, and C so that they only contain the variables n, σyy¯, and constants.

c)     In earlier math classes, you may have learned that “completing the square” involves rewriting a quadratic of the form in the form

Ah2 + Bh + C

A(h + )2 2A

B2

+ C  4A

Such a quadratic is minimized when h =  B , and its minimum value is C B2 .

Evaluate the above expressions ( B and C B2 ). Conclude that the mean minimizes mean squared

error and that the variance is the minimum possible value of mean squared error.

Problem 3. What Do You Mean?

As discussed in class (and in Problem 2), if we  choose squared loss to be our loss function, i.e. Lsq(h, y) =  (y  h)2, the prediction h that minimizes empirical risk is the mean, i.e. h = Mean(y1, y2, ..., yn).

In this problem, we will look at a new loss function, the “relative squared loss function” Lrsq(h, y):

(y h)2

Lrsq(h, y) = y

Throughout this problem, assume that each of y1, y2, ..., yn is positive.

a) Determine  Lrsq, that is, the partial derivative of the relative squared loss function with respect

to h.

b) What value of h minimizes empirical risk for the relative squared loss function, i.e. Rrsq(h) =

1 Σn

(yih)2 ?  Your answer should only be in terms of the variables n, y , y , ..., y

, and any con-

stants.

Hint: You will need to use your answer from the previous part.

c)    Let H(y1, y2, ..., yn) be your result from the previous part. (That is, for a particular dataset  y1, y2, ..., ynH(y1, y2, ..., yn) is the value of h that minimizes empirical risk for relative squared loss on that dataset.)

What is the value of limy4 →∞ H(1, 2, 4, y4)? Your answer should involve the function H and/or one  or more constants.

Hint: To notice the pattern, evaluate H(1, 2, 4, 100), H(1, 2, 4, 10000), and H(1, 2, 4, 1000000).

d)    What is the value of limy4 0 H(1, 2, 4, y4)? Again, your answer should involve the function H

and/or one or more constants.

e)     Based on the results of part c) and part d), when is the prediction H(y1, y2, ..., yn) robust to outliers? When is it not robust to outliers?

Problem 4. Garage Sale (Continued!): Everything d Dollars!

In this scenario, lets assume that if an item is marked at a price $x that is less than or equal to

$d, it will sell for $x and if it is priced any more than $d, it will not sell (Please note that this strategy is NOT  the same as the one you used in Homework 1).   Thus,  for a fixed value  xi, consider the      garage sale loss function,

L(d) = L(dx ) = xi  d, if d  xi

xi, else

. (1)

Recall that if xi represents the dollar value of an item at a garage sale, L(d; xi) represents the amount of potential earnings lost if we price the item at d dollars. To reiterate, we assume that for each item, if the price of the item is more than its value, then the item will not sell, and otherwise, it will sell for the asking price.

For a set of n items with dollar values x1 x2 · · · xn, define the total loss as

n

T (d) = L(dxi).

i=1

Our goal is to find the asking price d that minimizes this total loss T (d). (Note that total loss is similar to empirical risk, but does not have a factor of 1 in front.)

a)    Can we use gradient descent to minimize T (d)? Explain why or why not.

b)    Notice that L(d; xi) is made up of linear functions, which means T (d) is also made up of linear functions. Suppose n = 3 and 0 < x1 < x2 < x3. For d in each of the intervals below,  what is  the slope of the graph of T (d)?

1. d < x1

2.  x1 < d < x2

3.  x2 < d < x3

4. d > x3

Hint: Refer back to Lecture 2, where we figured out the slope of Rabs(h), and apply a similar strategy.

c)    Give a formula for the slope of T (d) at any value of d (besides d x1, d x2, . . . , d xn, where the slope is not defined). This formula should work for any number of data points n and any set of values x1 x2 · · · xn, even if some of them are the same.

Hint: Refer back to Lecture 2, where we figured out the slope of Rabs(h), and apply a similar strategy.

d) Use your formula to explain why T (d) must be minimized at one of {x1, x2, . . . , xn}.

e)     In the supplemental Jupyter Notebook, which can be found at this link, complete the implementation of maximize money, which takes in a list of xis and returns the optimal selling price d. You can assume that the input list is sorted.

Turn in an explanation of the strategy you used to solve this problem, a screenshot of your code for the function, and the output of your function on each of the following inputs:

1. [1, 2, 4, 5, 10]

2. [4, 10, 13, 16, 20, 21, 22, 25, 32]

3. [3, 3, 3, 3, 9, 10, 12, 12, 15, 15]

Problem 5. Exploring Gradient Descent

In this class, we’ll primarily use gradient descent to minimize empirical risk. We’ll see a lot of this in lecture and in upcoming assignments. However, gradient descent can also be used to minimize functions that have nothing to do with empirical risk.

Suppose want to minimize the polynomial function g(u) using gradient descent, where

g(u) = u4 24u3 + 154u2 207u 12

a)     What is g(u), the derivative of g with respect to u?

b)     Gradient descent is guaranteed to find the global minimum of a function if the function is convex (also called concave up) and differentiable, if given an appropriate choice of step size. Remember from Lecture 5 that if f : R R is a function of a single variable and is twice differentiable, then f is convex if and only if

f ′′

d2f

(x) = dx2 0, for all x.

Show that g(u) is not convex.

Hint: All you need to do is use the test for convexity above and show that there is at least one value of x for which it does not hold.

c)    Since g is not convex, gradient descent is not guaranteed to converge to a global minimum. However, given an appropriate choice of step size, gradient descent will converge to a local minimum, which may or may not be a global minimum.

For non-convex functions,  whether  or  not  gradient  descent  converges  to  a  global  minimum  depends  on where we initialize gradient descent, i.e. where our “first guess” for the minimizing value is. Let’s experiment with this idea.

In the supplemental Jupyter Notebook, which can be found at this link, complete the following tasks:

• Complete the implementation of g(u).

• Complete the implementation of dg(u) (i.e. the derivative of g(u)) using your answer from part a).

• Run the cell provided to create a plot of g(u). You don’t need to do anything to create this plot.

• Based on the above plot, answer the following questions in your write-up:

– How many local minimums does g(u) have?

– What are the values of u that correspond to local minimums? (Come up with approximate values based on the plot, your answers don’t need to be exact.)

– Which value of u from above corresponds to a global minimum, if any?

d)     At the bottom of the supplementary notebook, linked above, you will see a call to the function visualize gradient descent. This function animates the execution of gradient descent on a single-variable function (in our case, g). It allows us to specify different initial guesses for the best u (through the argument h 0) and different step sizes/learning rates (through the argument alpha).

Your task in this subpart is to call visualize gradient descent with different pairs of initial guesses and step sizes and observe the resulting behavior. Start with an initial guess of 5 and step size of 0.01.

In your write-up, all you are required to do is include answers to the following questions:

• Suppose we choose -2 as an initial guess. Find two different step sizes alpha such that gradient descent converges at different local minimums for both step sizes. In your write-up, note the two step sizes and which local minimums they each converged to, and give a brief explanation as to why you think this happened.

• If you set alpha to 0.1, you will get an error that says OverflowError: (34, ’Result too large’). Why do you think this happens, given that 0.1 is a relatively small step size? (Refer to the scale of the function g(u) in your answer.)