Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

EE 559

Homework 6

2022

1.    This problem  is  to be  done by hand.    In  the parts below, you will  set up  the primal Lagrangian and derive the dual Lagrangian, for support vector machine classifier, for the case of data that is linearly separable in expanded feature space.

For the SVM learning, we stated the optimization problem as:

Minimize J(w)= w2

s.t.  zi (wT ui+ w0 ) 1 0  i

Assume our data is linearly separable in the expanded feature space.  Also, nonaugmented notation is used throughout this problem.

(a)   If the  above  set  of constraints  (second  line  of equations  above)  is  satisfied, will  all the

training data be correctly classified?

(b)   Write the Lagrangian function L (  ww0 ,λ) for the minimization problem stated above.  Use

λi , i = 1, 2, !N  for the Lagrange multipliers.   Also  state the KKT  conditions.   (Hint: there are 3 KKT conditions).

(c)   Derive the dual Lagrangian LD , by proceeding as follows:

(i)     Minimize L  w.r.t. the weights.

Hint: solve ∇  wL = 0 for the optimal weight w* (in terms of λi  and other variables);

 L 

(ii)    Substitute  your  expressions  from  part  (i)  into  L ,  and  use  your  expression  from

 L 

LD (λ) =   λiλjzizj uiT uj   +  λi

subject to the (new) constraint:   λizi = 0 , which becomes a new KKT condition.

Also give the other two KKT conditions on λi , which carry over from the primal form.

 

2.    In this problem you will do SVM learning on a small set of given data points, using the result from Problem 1 above.  Problem 2 also uses nonaugmented notation.

Coding.  This problem involves some work by hand, and some solution by coding.  As in previous  homework  assignments,  in  this  problem  you  are  required  to  write  the  code yourself; you may use Python built-in functions, NumPy, and matplotlib; and you may use pandas only for reading and/or writing csv files.   For the plotting, we are not supplying plotting function(s) for you; may use this as an opportunity to (learn, explore, and) use matplotlib functions as needed.  Alternatively, you may do the plots by hand if you prefer.    Throughout this problem, let the dataset have  = 3 points.  We will work entirely in the expanded feature space (  -space).

(a)   Derive  by  hand  a  set  of equations,  with λ ,  as  variables,  and  ! ,   ! ,    = 1,2,3 as  given

constants, that when solved will give the SVM decision boundary and regions.  To do this, start  the  Lagrangian  process  to  optimize  LD   w.r.t.  λ and  ,  subject  to  the  equality

constraint λizi = 0 .  Set all the derivatives equal to 0 to obtain the set of equations.            Hint: Start from   LD  (λµ) =  λi  λiλjzizjuiT uj   +  µziλi  , in which the

last term has been added to incorporate the equality constraint stated above.

Finally, formulate your set of equations as a matrix-vector equation of the form:

  = 

in which ρ = 1λ" ,  µ3" .  Give your expressions for  and   .

Tip:  If you’d like a quick check to see if you’re on the right track, try plugging in (by hand) for this simple dataset:

 #  = 4  60(1) ,   $  = 4  60(0) ∈ #   (#  = $  = 1);      %  = 4  61(1) ∈ $  ( %  = − 1). Thefirstrowof  should be [1   0  − 1  − 1] and firstentryof  shouldbe1.

 

In the parts below you will use a computer to solve this matrix equation for λ and µ , calculate then plot and interpret the results, for 3 different datasets.

For parts (b)-(e) below, use the following dataset:

 #  = 4  62(1) ,  $  = 4  61(2)  #

 %  = 4  60(0)   $

For all parts below, consider  #  to be the positive class ( !  = +1).

 

(b)   Write  code  in Python to  solve  for λ and µ ,  calculate       and '  ,  and  check the KKT

conditions.

Tip: Use variables    ! ,   ! ,  = 1,2,3   such that you can input their values for different datasets.

Specifically, the code should:

(i)     Use NumPy to invert your matrix, and to calculate the resulting values for λ and µ .    (ii)    Check  that  the  resulting  λ satisfies  the  KKT  conditions  involving    λ (but  not

involving   ) that you stated in Problem 1(c)(ii).

(iii)   Calculate the optimal (nonaugmented) weight vector    by using your result from

Problem 1(c)(i).  And, find the optimal bias term '  using one of the KKT conditions from Problem 1(b).

(iv)   Check that the resulting  and '  satisfy the KKT conditions on    and '  of Pr.

1(c).

(v)    Output the results of (i)-(iv) above.

(c)   Run your code on the given dataset;  give the resulting values for λ and µ , and the results of the KKT-conditions check on λ .

Also give the resulting  ∗  and ' , and the results of the KKT-conditions check on    and

' .

(d)   Plot in 2D nonaugmented feature ( ) space:  the data points showing their class labels, the    decision boundary defined by    ∗  and ' , and an arrow showing which side of the boundary

is class # .  While you are encouraged to do (some or all of) the plot by computer, you can do some or all of it by hand if you prefer, as long as it is neat and reasonably accurate.

(e)   Looking at the plot, does the decision boundary correctly classify the training data?

And if yes, does it look like the maximum-margin boundary that will correctly classify the data?   Briefly justify.

(f)(i)-(iii)   Repeat parts (c) – (e) except for the following dataset:

 #  = 4  62(1) ,  $  = 4  61(2)  #

 % ' = 4  61(1)  ∈ $

in which the third data point has been changed.

(iv)   Explain the change, or lack of change, in the decision boundary from (d) to (f).

(g)    (i)     How do you think the boundary will change (relative to (f)) if we instead use the

following data:

 #  = 4  62(1) ,  $  = 4  61(2)  #

 % '' = 4 1056   $

in which the third data point has (yet again) been changed?

(ii)-(iv)  Try it by repeating (c)-(e) except with these data points.

(v)    Explain any change or lack of change in the decision boundary as compared with (d) and (f).

 

Comment:   this method of implementation (Pr. 2 using matrix inverse) is useful for working with  algebraic  expressions,  theory,  and  seeing/verifying  how  things  work.    For  larger datasets, typically other implementation methods are used such as quadratic programming or variations on gradient descent designed for this type of optimization.