闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Problem 1

We consider a binary classiﬁcation with a linear logistic regression. Let x ∈ Rd , and w ∈ Rd be an

d-dimensional input vector, and a parameter of the model, respectively. The classiﬁer is represented

by f (x) = 2 w⊤x > 0 − 1, where c denotes an indicator function that returns 1 if c is true,

for the logistic regression. The optimization problem can be written as

wˆ = argmin J(w)

J(w) := (ln (1 + exp(−yi w ⊤xi ))) + λw⊤w .

Here we assume that xi contains constant value 1 to make the classiﬁer adaptable to offset in d − 1

consider to implement some optimization methods in the following way.

1. Implement batch steepest gradient method1 .

2. Implement Newton based method.

3. Compare the performance of the above two optimization methods by showing |J(w(t)) −

timal parameter that reaches minimum of J obtained by (either of) the two methods, in

semi-log plot.

4. Implement Newton method and simple steepest gradient method for multiclass version of logistic regression (use Toy dataset V) and run the same experiment as binary logistic regression mentioned above.

Problem 2

We consider lasso, where the square loss, and the L1 regularization are employed for linear regres- sion. In this problem, we employ proximal gradient method (PG). So as to make the discussion simple, we use the following objective:

wˆ = arw(gm)in ((w − µ)⊤ A (w − µ) + λ||w||1 ) .

Implement PG for lasso and show the results in a couple of conditions. In this question, use the same learning rate ηt = L −1 , where L depicts the Lipsitz constant of the gradient of the objective, which is derived from the Hessian matrix 2A (i.e. use the maximum eigen value of 2A as the inverse of the learning rate: η1 ).

1. Show the result of PG in terms of ||w(t) − wˆ|| w.r.t. the number of iteration. Use semi log

A = ) , µ⊤ = ( 1 2 ) .

To verify the property of L1 regularization, run the experiment with λ = 2, 4, 6. Recall that the numerical result can be found in the slide used in the lecture. You can also verify the result by using cvx (matlab) / cvxopt (python).

Problem 3

We consider the dual of the support vector machine (L2-regularized hinge loss based binary classi- ﬁer). The original optimization problem of this classiﬁcation can be represented as

wˆ = argmin w∈Rd

( max(0, 1 − yi w ⊤xi ) + λ||w||2(2)) ,

(1)

where xi ∈ Rd , w ∈ Rd , yi ∈ {±1}, and λ > 0 denotes i-th input variable, the parameter vector, and the label for i-th input data, and a coefﬁcient of the regularization term, respectively.

1. Verify that the dual Lagrange function of this optimization can be written as

(2)

where 1 and 0 denote a n dimensional vector whose elements are all 1 and 0, respectively, and K ∈ Rn ×n denotes a symmetric square matrix, and its i-th row and j -th column element can be represented by yi yj xi(⊤)xj .

2. From the KKT condition, verify the optimal weight parameter w given by α can be written as

wˆ = αi yi xi (3)

3. Implement minimization for the negative dual Lagrange function using projected gradient (for simplicity). For the validity, show the score of the dual Lagrange function (use (2)), and sum of hinge loss function and the regularization, w.r.t. number of iteration (use (1) and (3)), respectively. Conﬁrm that the duality gap reaches 0 for the convergence. Here we assume projected gradient just computes

α (t) = P[0 , 1]n (α (t−1) − ηt ( Kα(t−1) − 1)) ,

where ηt represents learning rate at t-th iteration, and P[0 , 1]n depicts a projection operator that each of the input cast into [0, 1].

Problem 4

We consider a binary classiﬁcation problem, where the hinge loss function, and L1 regularization

are leveraged. Let x ∈ Rd be an input, and w ∈ Rd be a parameter of the model, respectively.

We consider a binary discriminant function with linear regression as f (x) = 2 w⊤x ≥ 0 − 1,

learning problem can be formalized as

wˆ = argmin w∈Rd

( max(0, 1 − yi w ⊤xi ) + λ||w||1 ) ,

(4)

where yi ∈ {±1} denotes a binary label for i-th training example.

1. Derive a linear program from (4) (with auxiliary variable ξi ≥ max(0, 1 − yi w ⊤xi ) ≥ 0,

minimizez c⊤z

subject to Az ≤ b

(See LpBoost that deals with LP from L1-regularized hinge loss model)

2. By using some artiﬁcial dataset (see Toy Dataset section, Dataset IV), implement this prob- lem via cvx (in Matlab) / cvxopt (in python) (just for reference) and a (batch) proximal sub- gradient method. Then conﬁrm that the parameter properly converges. The speciﬁcation of thedatasetshouldbedescribedinthereport.

Dataset IV

n = 200 ;

x = 3 * (rand (n , 4 ) - 0 .5 );

y = (2 * x (:, 1 ) - 1 * x (:,2 ) + 0 .5 + 0 .5 * randn (n , 1 )) > 0 ; y = 2 * y - 1 ;

Dataset V

n = 200 ;

x = 3 * (rand (n , 4 ) - 0 .5 );

W = [2 , - 1 , 0 .5 ;

-3 , 2 , 1 ;

1 , 2 , 3 ];

[maxlogit , y] =max ( [x (:, 1 :2 ), ones (n , 1 )] * W’ + 0 .5 * randn (n , 3 ), [], 2 );