Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS M146

Homework Set #3: SVM and Boosting

2022

Problem 1 (Support Vector Machines [9 pts])

Suppose we are looking for a maximum-margin linear classifier through the  origin, i.e.  bias b = 0 for the hard margin SVM formulation, i.e., no slack variables. In other words,

min w2  s.t. y (i)wT x(i) ≥ 1,i = 1, . . . ,n.

(a) (3 pts) Given a single training vector x = (1, 1)T  ∈ R2  with label y = −1, what is the w

that satisfies the above constrained minimization?

(b) (3 pts) Suppose we have two training examples, x(1)  = (1, 1)T  ∈ R2  and x(2)  = (1, 0)T  ∈ R2

with labels y (1) = 1 and y (2) = −1. What is w in this case?

(c) (3 pts) Suppose we now allow the offset parameter b to be non-zero. How would the classifier and the margin change in the previous question? What are (w,b)? Compare your solutions with and without offset.


Problem 2 (Kernelized Support Vector Machines [20 pts])

Suppose we have six one-dimensional points: three with negative label (y = −1): x(1) = −1,x(2) = 0,x(3) = 1 and three with positive label (y = +1): x(4) = −3,x(5) = −2,x(6) = 3. You might notice that if a feature map such as ϕ(u) = (u,u2 ) is used to transform the points then a hyperplane in the new R2 feature space induced by ϕ can perfectly separate the points. The kernel corresponding to the feature map is k(u,v) = uv(1 + uv)

(a) (4 pts) Construct a maximum margin separating hyperplane in R2 which can be parameter-

ized by its normal equation, i.e. w1x+w2x2 +w0 = 0 for appropriate choices of w ∈ R3 . Here (x,x2 ) = ϕ(x) is the result of applying the feature map ϕ(x) to the original feature x. Also explicitly compute the margin for the hyperplane. Hint:  Identify support vectors by looking at the graph of the points in the induced feature space.

(b) (4 pts) Apply ϕ(x) to the data and plot the points in the new R2  feature space. On the plot

of the transformed points, plot the separating hyperplane you found in a) and the margin and circle the support vectors (SV).

(c) (3 pts) Draw the decision boundary of the separating hyperplane you found in a) in the original R1  feature space. Hint:  Find the decision boundary in the original R1  feature space by solving for x in the quadratic equation found in part (a).

(d) (6 pts) Find the αi  and b in

h(x) = sign αiy (i)k(x(i),x) + b!

for the kernel k and the support vectors chosen earlier. Do this by solving the dual form of the quadratic program.

α(ma)x   Pi αi Pi,jy(i)y(j)αiαik(x(i),x(j))

st                 0 ≤ αi , ∀i ∈ 1, . . . ,n

Pi αiy(i) = 0

(e) (3 pts) If we add another negatively labelled training point at x = 0.5, will the hyperplane

or the margin change? Explain.


Problem 3 (Boosting [24 pts])

Consider the following examples (x1 ,x2 ) ∈ R2  (i is the example index):

i

1

2

Label

1

0

8

2

1

4

3

3

7

+

4

-2

1

5

-1

13

6

9

11

7

12

7

+

8

-7

-1

9

-3

12

+

10

5

9

+

In this problem, you will use Boosting to learn a hidden Boolean function from this set of examples. We will use two rounds of AdaBoost to learn a hypothesis for this data set. In each round, AdaBoost

i

(1)

Label

(2)

Hypothesis 1 (1st iteration)

Hypothesis 2 (2nd iteration)

0

(3)

f1 sign(x1 )

(4)

f2 sign(x2 )

(5)

h1

(6)

1

(7)

f1(′) sign(x1 )

(8)

f2(′) sign(x2 )

(9)

h2

(10)

1

2

3

+

4

5

6

7

+

8

9

+

10

+

Table 1: Table for Boosting results

chooses a weak learner that minimizes the error ϵ .  As weak learners, use hypotheses of the form (a) sign(x1  − j1 ) or (b) sign(x2  − j2 ), for some integers j1 ,j2  (either one of the two forms, not a disjunction of the two).  There should be no need to try many values of j1 ,j2 ; appropriate values should be clear from the data and reset for each round. When using log, use base e.

(a) [6 points] Start the first round with a uniform distribution w0 , i.e., w0,i  = 0.1.  Place the

value for w0 for each example in the third column of Table 1. Write the new representation of the data in terms of the rules of thumb, f1  and f2 , in the fourth and fifth columns of Table 1.

(b) [6 points] Find the candidate hypothesis (i.e., one of f1  or f2 ) given by the weak learner

that minimizes the training error ϵ for that distribution.  Place this choosen hypothesis as the heading to the sixth column of Table 1, and give its prediction for each example in that column.

(c) [6 points] Now compute w1  for each example using h1 , find the new best weak learners f1(′) and f2(′) given these weights, and select hypothesis h2 that minimizes error on this distribution, placing these values and predictions in the seventh to tenth columns of Table 1.

(d) [6 points] Write down the final hypothesis produced by AdaBoost.

What to submit: Fill out Table 1 as explained, show computation of w1  and βi  for the chosen hypothesis at each round, and give the final hypothesis, H(x).


Problem 4 (Twitter analysis using SVM [52 pts])

In this project, you will be working with Twitter data.  Specifically, we have supplied you with a number of tweets that are reviews/reactions to movies,


e.g.,  “@nickjfrost just saw  The Boat  That Rocked/Pirate Radio and I thought it was brilliant!  You and the rest of the cast were fantastic!  < 3” .

You will learn to automatically classify such tweets as either positive or negative reviews.  To do this, you will employ Support Vector Machines (SVMs), a popular choice for a large number of classification problems.

Download the code and data sets from the course website. It contains the following data files:

• tweets .txt contains 630 tweets about movies.  Each line in the file contains exactly one tweet, so there are 630 lines in total.

• labels .txt contains the corresponding labels. If a tweet praises or recommends a movie, it is classified as a positive review and labeled +1; otherwise it is classified as a negative review and labeled −1. These labels are ordered, i.e.  the label for the ith  tweet in tweets .txt corresponds to the ith  number in labels .txt.

Skim through the tweets to get a sense of the data.

The python file twitter .py contains skeleton code for the project.   Skim through the code to understand its structure.


1 Feature Extraction [4 pts]

We will use a bag-of-words model to convert each tweet into a feature vector.  A bag-of-words model treats a text file as a collection of words, disregarding word order. The first step in building a bag-of-words model involves building a  “dictionary”.  A dictionary contains all of the unique words in the text file.  For this project, we will be including punctuations in the dictionary too. For example, a text file containing  “John  likes  movies .   Mary  likes  movies2!!”   will have a dic- tionary {’John’:0,  ’Mary’:1, ’likes’:2,  ’movies’:3,  ’movies2’:4,  ’. ’:5,  ’!’:6}. Note that the  (key,value) pairs are  (word,  index), where the index keeps track of the number of unique words (size of the dictionary).

Given a dictionary containing d unique words, we can transform the n variable-length tweets into n feature vectors of length d by setting the ith  element of the jth  feature vector to 1 if the ith dictionary word is in the jth  tweet, and 0 otherwise.

(a) (2 pts) We have implemented extract_words( . . .) that processes an input string to return

a list of unique words.  This method takes a simplistic approach to the problem, treating any string of characters (that does not include a space) as a “word” and also extracting and including all unique punctuations.

Implement extract_dictionary( . . .) that uses extract_words( . . .) to read all unique words contained in a file into a dictionary (as in the example above). Process the tweets in the order they appear in the file to create this dictionary of d unique words/punctuations.