553.662 Project 2; Gradient Descent. Spring 2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
553.662 Project 2; Gradient Descent.
Spring 2023
Due on Monday, March 13.
Please read carefully the directions below.
• You must type your answers (using, e.g., LateX or MS Word) and return them in a pdf file. A 5% penalty will be applied otherwise.
• A solution in the form of a program output, or a Jupyter notebook output, is not acceptable.
• You must provide your program sources. They must be added as a separate file (possibly zipped), sep- arated from your answers to the questions. They will not graded (so no direct credit for them), but not
providing them will result in a zero grade in questions that use them.
The program sources will be useful in order to understand why results are not correct (and decide whether partial credit can be given) and to ensure that your work is original.
You may use any programming language, although Python is strongly recommended. A Jupyter notebook is also acceptable as a program source.
• Late return policy. Papers will be accepted until March 13 without penalty. After this date, papers will be accepted with a 5% penalty on the grade until one week after the due date. No paper will be accepted after that extra week.
• Data files. The “.csv” files associated with this project are available on Canvas.
Problem 1
Note: This question must be solved without invoking any result on the diagonalization of symmetric matrices. Let A be an n x n symmetric matrix and Ω = Rn \ {0}. Define, for x ∈ Ω, the function
F (x) = xT Ax
(1) Using the fact that F(x) = F (x/|x|) for all x ∈ Ω, prove that argminΩ F and argmaxΩ F are not empty.
(2) Compute ⅤF(x) for x ∈ Ω and prove that xT ⅤF(x) = 0 for all x ∈ Ω.
(3) Prove that ⅤF(x) = 0 if and only if there exists λ ∈ R such that Ax = λx.
(4) Let
h(x) = Ax - x
Prove that, when ⅤF(x) ≠ 0, -h(x) is a direction of descent for F at x.
(5) Compute Ⅴ2F(x) at x ∈ Ω and show that xT Ⅴ2F(x)x = 0 for all x ∈ Ω.
(6) Let x ∈ Rn be such that Ax = λx for some λ ∈ R. Prove that x ∈ argminΩ F requires that A - λIdRn 2 0 and x ∈ argmaxΩ F that A - λIdRn s 0.
(7) For x ∈ Ω, let
v(x) = - F (x)2.
Prove that v(x) > 0 for all x and that v(x) = 0 if and only if ⅤF(x) = 0. (Hint: use Schwartz inequality and conditions under which it is an equality.)
(8) For α > 0 and x ∈ Ω, prove that x - αh(x) ∈ Ω.
(9) For α > 0 and x ∈ Ω, let
xα = |x - αh(x)| .
Prove that, when ⅤF(x) ≠ 0, F (xα) < F (x) for small enough α, and that α |→ F (xα) is, when |x| = 1, minimized at
α· (x) = -(w(x) - F (x)v(x)) + -(w(x) - F (x)v(x))2 + 4v(x)3
with w(x) = h(x)T Ah(x).
(10) Take e = 10-8 . Program an algorithm that takes as input a matrix A, an initial vector x0 with |x0 | = 1 and a maximal number of iterations, N, and iterates
xt+1 = xt - α· (xt)h(xt)
until t = N or |ⅤF(xt)| < e, whichever comes first.
Apply your algorithm to the matrix A in the file project2 A.csv, using N = 2000 and x0= ✶n/′n, where ✶n is the vector with all coordinates equal to 1. Return the number of iterations, tmax, needed by the algorithm and the final value of F (xt).
Plot the values of F(xt) as a function of t for t = 0, . . . , tmax . .
(11) Let x· ∈ argminΩ F . Assume that x 0(T)x· = 0 and show that x t(T)x· = 0 at each step of the preceding algorithm. Deduce from this that, under these assumptions, xt cannot converge to x· .
Problem 2
(1)
Let F : Rn → R be a C1 function. Prove that if x, u ∈ Rn are such that ⅤF(x)T u ≠ 0, then hu(x) = -(ⅤF(x)T u)u
is a direction of descent for F at x.
(2) Let e1, . . . , en be the canonical basis of Rn . Show that
hei (x) = -∂xi F (x)ei
Fix a small e > 0. Fix a sequence (it, t > 0) with it ∈ {1, . . . , N }. An algorithm that iterates
xt+1 =〈¥(╱¥¥)xt - αt∂xit F (xt)eit if |∂xit F (xt)| > e
is called a coordinate descent algorithm. This algorithm will be used in the next question.
Problem 3
(1) Let I c R be an interval. Prove that, if f : I c R is convex and non-decreasing, and ϕ : Rn → I is convex, then F = f o ϕ is convex.
(2) Prove that ψ : u |→ logcosh(|u|) is C1 and convex on Rn and give the expression of Ⅴψ (u).
(3) Assume that an integer d, and a set / of non-ordered pairs {i,j}, with 1 s i ≠ j s d are given. Let Ω be the vector space of all vectors indexed by /, i.e., the set of all
x = (x{i,j}, {i,j} ∈ /).
Alternatively, x ∈ Ω can be seen as a d x d symmetric matrix such that xij = 0 if {i,j} 廷 /. To lighten the notation, we write below xe = xij for e = {i,j} ∈ /.
Assume that a training set of vectors y1, . . . ,yN ∈ Rd is observed. Define, for x ∈ Ω, considered as a d x d matrix,
N n
F (x) = ψ (yk - xyk)
k=1 i =1
Prove that F is a convex function of x.
(4) Prove that
∂xij F (x) = - k (zi)yj) + z)yi))
with zk = yk - xyk .
(5) Write a program that reads the vectors y1, . . . ,yN in the file project2 Y.csv (with N = 50 and n = 100) and the locations of the non-zero entries of x in project C.csv and runs a gradient descent algorithm to minimize F .
Your program should define the sequence x(t) ∈ Ω satisfying
x(t + 1) = x(t) - αtⅤF(x(t))
initialized with x(0) = 0 (the zero matrix). The coefcient αt must be obtained using a backtracking line search, letting αt = ρrt where rt is the the smallest integer such that
F (x(t) - αtⅤF(x(t))) s F(x(t)) - c1 αt|ⅤF(x(t))|2.
You will take c1 = 0.01, = 0.1 and ρ = 0.9.
You will stop the program as soon as F(x(t - 1)) - F (x(t)) s 10-6 .
To describe the output of your program, provide the value of F at convergence, the largest element of x in absolute value and the number of iterations required.
(6) Order the elements of / as e1, . . . ,em as they are listed in the file project2 C.csv. Let qt = ej +1 if t = j(modm) (i.e., j is the remainder of the division of t by m, so that qt explores periodically all the pairs in /. For e = {i,j} ∈ /, let ξ(e) ∈ Ω be defined by ξ = 1 and ξ = 0 is {iJ,j J} ≠ e.
Program the coordinate descent algorithm in Question 2, taking e = 10-8 and
x(t + 1) =〈¥(╱¥¥)x(t) - αt∂xqt F (x(t))ξ(qt)
¥.x(t)
if |∂xqt F (x(t))| > e
otherwise
You will determine αt using the same method as in Question 3.5, taking = 1. You will stop the algorithm as soon as F (x(t - m)) - F (x(t)) s 10-6 (therefore using the di′erence between two full sweeps of coordinates).
Run your program using the data in project2 Y.csv. To describe its output, provide the value of F at convergence (hopefully the same as in Question 3.5), the largest element of x in absolute value and the number of iterations required.
2023-03-03