闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

EE 559 Sample Final

1. For the following classiﬁcation problem, design a single-layer perceptron that has zero error on training set.

C1 = {21 = ┌ 1(1) ┐ , 22 = ┌ 2(1) ┐}

C2 = {23 = ┌ 2_1 ┐ , 24 = ┌ 0(2) ┐}

C3 = {25 = ┌ _21 ┐ , 26 = ┌ _12 ┐}

C4 = {23 = ┌ _(_)1(1) ┐ , 24 = ┌ _(_)2(2) ┐}

so|ution: The patterns are shown in the following ﬁgure:

And tentative decision boundaries can be those in the following ﬁgure:

It is left to the student to verify that a single classify the vectors given that the four classes

C1 - t1 = ┌ C2 - t2 = ┌

┌

layer perceptron with two neurons can are modeled using vectors

_(_)1(1) ┐

_1 ┐

1_1 ┐

1(1) ┐

and the weight vectors are columns of the following matrix

疝 = ┐

and the biases in the vector:

b = ┌ 0(1) ┐

The architecture of the Perceptron is:

x2 x 1

f(WTx+b) = sign(WTx+b)

2. Consider the Mean-Squared Error criterion for determining Linear Discriminants.

(a) Reformulate the problem so that the weight vector w is regularized with a Ridge

(/2 ) penalty. In your formulation, se x, the matrix of reﬂected patterns, whose rows are reﬂected patterns zi2i and remember that the patterns are represented in the augmented space, but for simplicity the prime symbol was dropped. Also assume that the elements of the vector of target values for zi2i ’s are denoted as bi ’s.

(b) Provide a Gradient Descent solution to the optimization problem, so as to formu-

late the /2-regularized Widrow-Hoﬀ algorithm.

(c) Solve the optimization problem for determining w algebraically, to obtain an /2-regularized pseudo-inverse algorithm. You can assume nonsingularity of any matrices that you need to invert.

(d) Apply the Kernel trick to your /2-regularized pseudo-inverse algorithm, i.e. rep- resent your training vectors in an expanded feature space via the transformation u = ϕ(2), and show that you can predict the discriminant score of each data point in the new feature space without explicitly knowing ϕ(.), using the Kernel function κ(2i , 2j ) = [ϕ(2i )]T [ϕ(2j )].

Hint: You may ﬁnd the following identity useful:

(λ| + 扫A)_1 扫 = 扫(λ| + A扫)_1

Hint: You may have compact notations if you use the signed Kernel matrix Q = [Qij ], where Qij = zi zj κ(2i , 2j )

良n|utinl:

(a) The regularized objective function is

J(x) = |kx _ b|2 + λ|x|2

(b) Calculating the gradient of the objective function:

Vw J = Vw ┌ (kx _ b)T (kx _ b) + λxT x] = Vw ┌xT kT kx _ 2xT kT b + λxT x┐

= 2kT kx _ 2kT b + 2λx

= 2(k k + λI)xT _ 2kT b

The regularized Widrow-Hoﬀ algorithm becomes:

x(i + 1) = x(i) _ η(i) ╱kT k + λ|)x(i) _ kT b┐ = [1 _ η(i)λ] x(i) _ η(i)kT (kx(i) _ b)

which can be seen as a Widrow-Hoﬀ algorithm with a decay (forgetting) factor 1 _ η(i)λ on x(i).

(c)

Vw J = 0 ÷ 2(kT k + λ|)x _ 2kT b = 0 ÷ x = (kT k + λ|)_1 kT b

(d) Using the identity,

(kT k + λ|)_1 kT = kT (λ| + kkT )_1

Remember that the kernel trick uses a representation of features u = ϕ(2). Let us call the data matrix whose rows are zi ui(T) = zi [ϕ(2i )]T , 齿. Then the weight vector of the linear discriminant the new feature space is:

x = (齿T 齿 + λ|)_1 齿T b = 齿T (λ| + 齿齿T )_1 b Recall that we called 齿齿T , 俭 (why?). Then:

x = 齿T (λ| + 俭)_1 b

The discriminant score of some test point 2r is:

g(2r ) = xT ϕ(2r ) = bT (λ| + 俭)_1 齿ϕ(2r )

However, rows of 齿 are zi ϕ(2i )T ’s, so the last term in the above expression k(2r ) = 齿ϕ(2r ) is a vector whose elements are zi ϕ(2i )T ϕ(2r ) = zi κ(2i , 2r ). Therefore, the discriminant score of any new point can be written only based on the kernel function and without explicit knowledge of the transformation ϕ(.), i.e.:

g(2r ) = xT ϕ(2r ) = bT (λ| + 女)_1 k(2r )

Note that we have assumed that we used that the expanded feature space is augmented, so the dicriminant function in the expanded feature space does not contain an implicit w0 term.

3. Answer the following questions on PCA and Fisher’s LDA:

(a) In the following Figure, draw the ﬁrst principal component direction in the left

ﬁgure, and Fisher’s linear discriminant direction in the right ﬁgure. For linear discriminant, consider round points as the positive class, and both diamond and square points as the negative class. You do not need to determine the directions computationally.

(b) Consider 3 data points in R2 : 21 = [_1 1]T , 22 = [0 0]T , 23 = [1 _ 1]T .

i. What are the ﬁrst and second principal component directionss (write down the actual vectors)?

ii. If we project the original data points into the new coordinate system rep- resented by the principal component directions, what are their coordinates? And what is the variance of the data in each direction? Verify that it is equal to the total variance of the data.

良n|utinl:

(a)

(b) You do not solve any eigenvalue or SVD problem to see that the ﬁrst and second principal directions are e1 = [_ ]T and e2 = [ ]T . Note that _e1 and _e2 are also acceptable solutions.

(c) The coordinates of the data in the new coordinate system are computed as: _i = [2i(T)e1 2i(T)e2]T . Therefore, _1 = [^2 0]T , _2 = [0 0]T , _3 = [_^2 0]T . The variance of the ﬁrst component is i(3)=1 zi(2)1 = (Note that both the original data and principal components are centered). The variance of the second prin- cipal component is obviously zero. The total variance of the original data is: ((_1)2 + 02 + 12 ) + ((1)2 + 02 + (_1)2 ) = , which is equivalent to the total variance of the principal components. The second principal component is actually redundant.

4. Answer the following questions about the ﬁgure:

(a) What is the leave-one-out cross-validation error estimate for maximum margin

separation in the ﬁgure? (The answer is a number)

(b) True or False? We would expect the support vectors to remain the same in general

as we move from a linear kernel to higher order polynomial kernels.

(c) If each of the classes is modeled with a normal distribution and the variances are assumed equal, how many parameters are needed to describe the decision boundary for classiﬁcation?

(d) Assume that all data points s.tside the margin are unlabeled. Mark on the ﬁgure with a - the ﬁrst data point that has to be labeled, when implementing active learning. Also, mark on the ﬁgure with a + the ﬁrst data point that has to be labeled, when implementing self-training.

so|ution:

(a) Based on the ﬁgure we can see that removing any single point would not chance the

resulting maximum margin separator. Since all the points are initially classiﬁed correctly, the leave-one-out error is zero.

(b) There are no guarantees that the support vectors remain the same. The feature

vectors corresponding to polynomial kernels are non-linear functions of the original input vectors and thus the support points for maximum margin separation in the feature space can be quite diﬀerent.

(c) Modeling classes with normal densities with euqal variances creates linear decision boundaries. A linear decision boundary with two variables x1 and x2 is described by three parameters (w0 , w1 , w2 ), i.e.: g(x1 , x2 ) = w0 + w1 x1 + w2 x2 .

(d)

5. Consider the task of training a support vector machine using the Laplace kernel κ(2i , 2j ) = exp(_γ|2i _ 2j |). In this problem you will show that as long as there are no two identical points in the training set, we can always ﬁnd a value for the pa- rameter γ such that the SVM achieves zero training error. This is equivalent to saying that SVM with a Laplace kernel has inﬁnite VC dimension.

(a) Remember that the discriminant function can be written only using the kernel

function as:

g(2) = λi zi κ(2, 2i ) + w0

i=1

where 2 is the point whose discriminant score (and hence class) has to be decided. Assume that the training data {(21 , z1 ), . . . , (2N , zN )} consists of points which are separated by at least a distance of ∈; that is, |2i _ 2j | 2 ∈ > 0 for any i j . Find the Lagrange multipliers λi , w0 , and γ such that all 2i are correctly classiﬁed. Hint: Let λi = 1 for all i and w0 = 0. Now notice that for z e {_1, +1} the prediction on 2i will be correct if |g(2i ) _ zi | < 1, so ﬁnd a value of γ that satisﬁes this inequality for all i.

(b) Can we simply train a SVM with a Laplace Kernel by using the value of γ you

found in 5a and by setting λi = 1, Ai and w0 = 0 for any data set?

so|ution:

(a) First we set λi = 1 for all i = 1, . . . , N and w0 = 0. Then, for a training example (2i , zi ), we have:

|g(2i ) _ zi | = | zj κ(2j , 2i ) _ zi |

j=1

= | zj exp (_γ|2j _ 2i |) _ zi |

j=1

= |zi + zj exp (_γ|2j _ 2i |) _ zi |

j i

= | zj exp (_γ|2j _ 2i |)|

j i

< |zj exp (_γ|2j _ 2i |)|

j i

= |zj ||exp (_γ|2j _ 2i |)|

j i

= exp (_γ|2j _ 2i |)

j i

< exp(_γ∈) = (N _ 1) exp(_γ∈)

j i

where we used the triangle inequality and |2i _ 2j | 2 ∈ > 0, Ai j . We need to choose γ such that (N _ 1) exp(_γ∈) < 1. So, γ > . One can choose γ = .

(b) The answer is obviously no. Zero training error does not mean that we have a

good classiﬁer, because it may be heavily overﬁt to the noise in the data set. Remember, there is no ubiquitos training algorithm that works well for all data sets, i.e., there is NO FREE LUNCH!