CS 189/289A Introduction to Machine Learning Spring 2023

发布时间：2023-02-06

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS 189/289A Introduction to Machine Learning

Spring 2023

HW2: I v Math

Due Wednesday, February 8th at 11:59 pm

• Homework 2 is an entirely written assignment; no coding involved.

• We prefer that you typeset your answers using LATEX or other word processing software. If you haven’t yet learned LATEX, one of the crown jewels of computer science, now is a good time! Neatly handwritten and scanned solutions will also be accepted.

• In all of the questions, show your work, not just the final answer.

• Start early. This is a long assignment. Most of the material is prerequisite material not covered in lecture; you are responsible for finding resources to understand it.

Deliverables:

1. Submit a PDF of your homework to the Gradescope assignment entitled “HW2 Write-Up”. You may typeset your homework in LATEX or Word (submit PDF format, not .doc/.docx format) or submit neatly handwritten and scanned solutions. Please start each question on a new page. If there are graphs, include those graphs in the correct sections. Do not put them in an appendix. We need each solution to be self-contained on pages of its own.

• In your write-up, please state whom you had discussions with (not counting course staff) about the homework contents.

• In your write-up, please copy the following statement and sign your signature next to it. (Mac Preview, PDF Expert, and FoxIt PDF Reader, among others, have tools to let you sign a PDF file.) We want to make it extra clear so that no one inadvertently cheats. “I certify that all solutions are entirely in my own words and that I have not looked at another student’s solutions. I have given credit to all external sources I consulted.”

1 Identities and Inequalities with Expectation

For this exercise, the following identity might be useful: for a probability event A, P(A) = E[1{A}], where 1{·} is the indicator function.

1. Let X be a random variable with density f (x) = λe−λx 1{x > 0}. Show that E[Xk ] = k!/λk for integer k ≥ 0. Hint: One way to attempt this problem is by using induction on k.

2. For any non-negative random variable X and constant t > 0, show that P(X ≥ t) ≤ E[X]/t. This is known as Markov’s inequality.

Hint: show that for a, b > 0, 1{a ≥ b} ≤ a/b.

3. For any non-negative random variable X, prove the identity E[X] = l0 ∞ P(X ≥ t) dt.

You may assume that X admits a density function f(x); but we note that there is a proof of the identity that works even if X does not have a density function.

4. For any non-negative random variable X with finite variance (i.e., E[X2 ] < ∞), prove that

P(X > 0) ≥ (E[X])2

Hint: Use the Cauchy–Schwarz inequality〈u, v〉2 ≤〈u, u〉〈v, v〉. You have most likely seen it applied when the inner product is the real dot product; however, it holds for arbitrary inner products. Without proof, use the fact that the expectation E[UV] is a valid inner product of random variables U and V.

(Note that by assumption we know P(X ≥ 0) = 1, so this inequality is indeed quite powerful.)

5. For a random variable X with finite variance and E[X] = 0, prove that P(X ≥ t) ≤ for any t ≥ 0

Hint: Try using logic similar to Question 1.4 on t − X.

2 Probability Potpourri

1. Recall that the covariance of two random variables X and Y is defined to be Cov(X, Y) =

E[(X − E[X])(Y − E[Y])]. For a multivariate random variable Z (i.e., each component of

You can use the definition of PSD matrices in Q4.2.

2. The probability that an archer hits her target when it is windy is 0.4; when it is not windy, her probability of hitting the target is 0.7. On any shot, the probability of a gust of wind is 0.3. Find the probability that

(i) on a given shot there is a gust of wind and she hits her target. (ii) she hits the target with her first shot.

(iii) she hits the target exactly once in two shots.

(iv) On an occasion when she missed, there was no gust of wind.

3. An archery target is made of 3 concentric circles of radii 1/ √3, 1 and √3 feet. Arrows strik- ing within the inner circle are awarded 4 points, arrows within the middle ring are awarded

3 points, and arrows within the outer ring are awarded 2 points. Shots outside the target are awarded 0 points.

Consider a random variable X, the distance of the strike from the center in feet, and let the probability density function of X be

f (x) = \

What is the expected value of the score of a single strike?

4. A random variable Z is said to be drawn from the Poisson distribution with parameter λ > 0 if it takes values in nonnegative integers with probability P(Z = k) = . Let X and Y be two independent Poisson random variables with parameters λ > 0 and u > 0 respectively. Derive an expression for P(X |X + Y = n). What well-known probability distribution is this? What are its parameters?

3 Properties of the Normal Distribution (Gaussians)

1. Prove that E[eλX] = eσ2 λ2 /2, where λ e R is a constant, and X 欢八(0, σ2). As a function of λ, E[eλX] is also known as the moment-generatingfunction.

2. Concentration inequalities are inequalities that place upper bounds on the likelihood that a random variable X is far away from its mean u, written P(|X 一 u| 司 t), with a falling exponential function ae一bt2 having constants a, b > 0. Such inequalities imply that X is very likely to be close to its mean u. To make a tight bound, we want a to be as small and b to be as large as possible.

For t > 0 and X 欢八(0, σ2), prove that P(X 司 t) ≤ exp(一t2 /2σ2), then show that P(|X| 司 t) ≤ 2 exp(一t2 /2σ2).

Hint: Consider using Markov’s inequality and the result from Question 3. 1.

3. Let X1 , . . . , Xn 欢八(0, σ2) be i.i.d. (independent and identically distributed). Find a concen- tration inequality, similar to Question 3.2, for the average of n Gaussians: P( z1 Xi 司 t)? What happens as n 一伪?

Hint: Without proof, use the fact that linear combinations of i.i.d. Gaussian-distributed vari- ables are also Gaussian-distributed. Be warned that summing two Gaussian variables does not mean that you can sum their probability density functions (no no no!).

4. Let X e Rn 欢八(0, σ2In) be an n-dimensional Gaussian random variable, where In denotes the n人n identity matrix. You may interpret X as a (column) vector whose entries are i.i.d. real values drawn from the scalar Gaussian 八(0, σ2). Given a constant (i.e., not random) matrix A e Rn人n and a constant vector b e Rn , derive the mean (which is a vector) and covariance matrix of Y = AX+b. Without proof, use the fact that any linear transformation of a Gaussian random variable is also a Gaussian random variable.

5. Let vectors u, v e Rn be constant (i.e., not random) and orthogonal (i.e.,〈u, v〉= u · v = 0). Let X = (X1 , . . . , Xn) be a vector of n i.i.d. standard Gaussians, Xi 欢八(0, 1), Ai e [n]. Let ux =〈u, X〉and vx =〈v, X〉. Are ux and vx independent? Explain. If X1 , . . . , Xn are independently but not identically distributed, say Xi 欢八(0, i), does the answer change? Hint: Two Gaussian random variables are independent if and only if they are uncorrelated.

6. Prove that E[max1≤i≤n |Xi |] ≤ C 4log(2n)σ for some constant C e R, where X1 , . . ~~. ,~~ Xn 欢

for some C\ ; but you don’t need to prove the lower bound).

Hint: Use Jensen’s inequality: f (E[Y]) ≤ E[f(Y)] for any convex function f .

4 Linear Algebra Review

1. First we review some basic concepts of rank and elementary matrix operations. Let A ∈ Rm×n and B ∈ Rn×p . Let In denote the n × n identity matrix.

(a) Perform elementary row and column operations to transform In

AB(0) to 0(B)

In」

Al .

(b) Use part (a) to prove that rank A + rank B − n ≤ rank(AB) ≤ min{rankA, rank B}.

(c) Using only the ideas of the row space and column space of A (and their relationship to matrix-vector multiplication), explain why rank(A⊤A) = rank A.

2. Let A ∈ Rn×n be a symmetric matrix. Prove equivalence between these three different defi- nitions of positive semidefiniteness (PSD). Note that when we talk about PSD matrices, they are defined to be symmetric matrices. There are nonsymmetric matrices that exhibit PSD properties, like the first definition below, but not all three.

(a) For all x ∈ Rn , x⊤Ax ≥ 0.

(b) All the eigenvalues of A are nonnegative.

Positive semidefiniteness will be denoted as A ⪰ 0.

3. Recall that the (Frobenius) inner product between two matrices of the same dimensions A, B ∈ Rm×n is defined to be ⟨A, B⟩ = trace (A⊤ B) (sometimes written ⟨A, B⟩F to be clear). The Frobenius norm of a matrix is defined to be ∥A∥F = 4x1 xj(n)= 1 |Aij |2 . Prove some of the following matrix identities. The Cauchy–Schwarz inequality and the identities above may be helpful to you.

(a) x⊤Ay = ⟨A xy, ⊤ ⟩ for all x ∈ Rm , y ∈ Rn , A ∈ Rm×n .

(b) ⟨A, B⟩ ≤ ∥A∥F ∥B∥F for all A, B ∈ Rm×n .

(d) If A, B ∈ Rn×n are real symmetric matrices with λmax (A) ≥ 0 and B being PSD, then ⟨A, B⟩ ≤ √nλmax (A)∥B∥F .

Hint: Construct a PSD matrix using λmax (A)

4. If M − N ⪰ 0 and both M and N are positive definite, is N− 1 − M− 1 PSD? Show your work.

5. Let A ∈ Rm×n be an arbitrary matrix. Recall that the maximum singular value of A is defined as σm(2)ax (A) = λmax (A⊤A) = λmax (AA⊤ ). Please prove that:

σmax (A) = u∈Rm ,v∈= 1, ∥v∥= 1(u⊤Av).

5 Matrix/Vector Calculus and Norms

1. Consider a 2 ×2 matrix A, written in full , and two arbitrary 2-dimensional vectors x, y. Calculate the derivative of

sin(A11(2) + eA11+A22) + x⊤Ay,

with respect to the matrix A. Hint: The dimension of the derivative should match that of A and use the chain rule.

2. Aside from norms on vectors, we can also impose norms on matrices. Besides the Frobenius norm, the most common kind of norm on matrices is called the induced norm. Induced norms are defined to be

∥Ax∥p

x丰0 ∥x∥p

where the notation ∥ · ∥p on the right-hand side denotes the vector ℓp -norm. Please give the closed-form (or the most simple) expressions for the following induced norms of A ∈ Rm×n .

(a) ∥A∥2 . (Hint: Similar to Question 4.5)

(b) ∥A∥∞ .

3. (a) Let α = yi ln βi for y, β ∈ Rn . What are the partial derivatives ?

(b) Let γ = Aρ + b for b ∈ Rn , ρ ∈ Rm , A ∈ Rn×m . What are the the partial derivatives ?

(c) Given x ∈ Rn , y ∈ Rm , z ∈ Rk and y = f(x), f : Rn →卜 Rm , z = g(y), g : Rm →卜 Rk . Please write the Jacobian as the product of two other matrices. What are these matrices?

(d) Given x ∈ Rn , y, z ∈ Rm , and y = f (x), z = g(x). Write the gradient ∇xy⊤z in terms of y and z and some other terms.

4. Consider a differentiable function f : Rn →卜 R. Suppose this function admits a unique global

optimum x∗ ∈ Rn . Suppose that for some spherical region X = {x | ∥x − x∗ ∥2 ≤ D} around

eigenvalue is 1. Prove that

f(x) − f(x∗) ≤

for every x ∈ X. Hint: Look up Taylor’s Theorem with Remainder. Use Mean Value Theorem on the second order term instead of the first order term, which is what is usually done.

5. Let X ∈ Rn×d be a design matrix, consisting of n sample points (one per row of X), each

of which has d features. Let y ∈ Rn be a vector of labels. We wish to find the best linear

approximation, i.e., we want to find the w that minimizes the cost function L(w) = ∥y−Xw∥2(2) .

6. (Optional bonus question worth 1 point. This question contains knowledge that goes be-

yond the scope of this course, and is intended as an exercise to really make you comfortable

with matrix calculus). Consider a differentiable function f : Rm×n →卜 R that is Lipschitz

continuous. You are given arbitrary matrices X, X′ e Rm×n where X′ = X − η∇Xf (X) for some

∇Xf(X(t)) − ∇Xf (X) F ≤ t ∇Xf (X) F .

Prove that

Hint: Invoke the fundamental theorem of calculus.

6 Gradient Descent

Consider the optimization problem Rn(in) x⊤Ax − b⊤x, where A ∈ Rn×n is a PSD matrix with

1. Find the optimizer x∗ (in closed form).

2. Solving a linear system directly using Gaussian elimination takes O(n3) time, which may be wasteful if the matrix A is sparse. For this reason, we will use gradient descent to compute an approximation to the optimal point x∗ . Write down the update rule for gradient descent with a step size of 1 (i.e., taking a step whose length is the length of the gradient).

3. Show that the iterates x(k) satisfy the recursion x(k) − x∗ = (I − A)(x(k−1) − x∗).

4. Using Question 4.5, prove ∥Ax∥2 ≤ λmax(A) ∥x∥2 .

Hint: Use the fact that, if λ is an eigenvalue ofA, then λ2 is an eigenvalue of A2 .

5. Using the previous two parts, show that for some 0 < ρ < 1,

∥x(k) ≤ ρ∥x(k−1) .

6. Let x(0) ∈ Rn be the starting value for our gradient descent iterations. If we want a solution

x(k) that is ϵ > 0 close to x∗ , i.e. ∥x(k) − x∗ ∥2 ≤ ϵ, then how many iterations of gradient descent

ρ, ∥x(0) − x∗ ∥2 , and ϵ .