Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


Problem 1

We consider a binary classification with a linear logistic regression. Let x ∈ Rd , and w ∈ Rd be an

d-dimensional input vector, and a parameter of the model, respectively. The classier is represented

by f (x) = 2 wx  > 0   1, where  c   denotes an indicator function that returns 1 if c is true,

for the logistic regression. The optimization problem can be written as

wˆ = argmin J(w)

J(w) := (ln (1 + exp(yi w xi ))) + λww .

Here we assume that xi  contains constant value 1 to make the classifier adaptable to offset in d 1

consider to implement some optimization methods in the following way.

1.  Implement batch steepest gradient method1 .

2.  Implement Newton based method.

3.  Compare the performance of the above two optimization methods by showing |J(w(t))

timal parameter that reaches minimum of J obtained by (either of) the two methods, in

semi-log plot.

4.  Implement Newton method and simple steepest gradient method for multiclass version of logistic regression (use Toy dataset V) and run the same experiment as binary logistic regression mentioned above.

Problem 2

We consider lasso, where the square loss, and the L1 regularization are employed for linear regres- sion.  In this problem, we employ proximal gradient method (PG). So as to make the discussion simple, we use the following objective:

wˆ = arw(gm)in ((w µ)A (w µ) + λ||w||1 ) .

Implement PG for lasso and show the results in a couple of conditions. In this question, use the same learning rate ηt  = L 1 , where L depicts the Lipsitz constant of the gradient of the objective, which is derived from the Hessian matrix 2A (i.e. use the maximum eigen value of 2A as the inverse of the learning rate: η1 ).

1.  Show the result of PG in terms of ||w(t)  wˆ|| w.r.t. the number of iteration. Use semi log

A =    ) , µ= (  1    2  ) .

To verify the property of L1 regularization, run the experiment with λ = 2, 4, 6. Recall that the numerical result can be found in the slide used in the lecture. You can also verify the result by using cvx (matlab) / cvxopt (python).

Problem 3

We consider the dual of the support vector machine (L2-regularized hinge loss based binary classi- fier). The original optimization problem of this classification can be represented as


wˆ = argmin w∈Rd


 max(0, 1 yi w xi ) + λ||w||2(2)) ,


(1)


where xi  ∈ Rd , w ∈ Rd , yi  ∈ {±1}, and λ > 0 denotes i-th input variable, the parameter vector, and the label for i-th input data, and a coefficient of the regularization term, respectively.


1.  Verify that the dual Lagrange function of this optimization can be written as

                                  (2)

where 1 and 0 denote a n dimensional vector whose elements are all 1 and 0, respectively, and K  ∈ Rn ×n  denotes a symmetric square matrix, and its i-th row and j -th column element can be represented by yi yj xi(⊤)xj .

2.  From the KKT condition, verify the optimal weight parameter w given by α can be written as

wˆ =   αi yi xi                                                                               (3)

3.  Implement minimization for the negative dual Lagrange function using projected gradient (for simplicity). For the validity, show the score of the dual Lagrange function (use (2)), and sum of hinge loss function and the regularization, w.r.t. number of iteration (use (1) and (3)), respectively. Confirm that the duality gap reaches 0 for the convergence. Here we assume projected gradient just computes

α (t)  = P[0 , 1]n    (α (t1)  ηt  ( Kα(t1)  1)) ,

where ηt represents learning rate at t-th iteration, and P[0 , 1]n   depicts a projection operator that each of the input cast into [0, 1].

Problem 4

We consider a binary classification problem, where the hinge loss function, and L1 regularization

are leveraged.  Let x  ∈ Rd  be an input, and w  ∈ Rd  be a parameter of the model, respectively.

We consider a binary discriminant function with linear regression as f (x)  =  2 wx  0   1,

learning problem can be formalized as


 

wˆ = argmin w∈Rd


 max(0, 1 yi w xi ) + λ||w||1 ) ,


 

(4)


where yi  ∈ {±1} denotes a binary label for i-th training example.

1.  Derive a linear program from (4) (with auxiliary variable ξi  max(0, 1 yi w xi ) 0,

minimizez      cz

subject to     Az ≤ b

(See LpBoost that deals with LP from L1-regularized hinge loss model)

2.  By using some artificial dataset (see Toy Dataset section, Dataset IV), implement this prob- lem via cvx (in Matlab) / cvxopt (in python) (just for reference) and a (batch) proximal sub- gradient method. Then confirm that the parameter properly converges. The specification of thedatasetshouldbedescribedinthereport.


Dataset IV

n  =  200 ;

x  =  3  *   (rand (n ,  4 )  -  0 .5 );

y  =   (2  *  x (:,  1 )  -  1  *  x (:,2 )  +  0 .5  +  0 .5  *  randn (n ,  1 ))  >  0 ; y  =  2  *  y  - 1 ;

Dataset V

n  =  200 ;

x  =  3  *   (rand (n ,  4 )  -  0 .5 );

W  =   [2 ,  - 1 ,  0 .5 ;

-3 ,  2 ,  1 ;

1 ,  2 ,  3 ];

[maxlogit ,  y]  =max (   [x (:,  1 :2 ),  ones (n ,  1 )]  *  W’  +  0 .5  *  randn (n ,  3 ),   [],  2 );