Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSCI567: Machine Learning

Fall 2022

Homework 1

Instructions

We recommend that you use LaTeX to write up your homework solution. However, you can also scan handwritten notes. The homework will need to be submitted on D2L. We will announce detailed submission instructions later.

Theory-based Questions

Problem 1: Perceptron Convergence (15pts)

Recall the perceptron algorithm that we saw in class. The perceptron algorithm comes with strong theory, and you will explore some of this theory in this problem.  We begin with some remarks related to notation which are valid throughout the homework. Unless stated otherwise, scalars are denoted by small letters in normal font, vectors are denoted by small letters in bold font, and matrices are denoted by capital letters in bold font.

Problem 1 asks you to show that when the two classes in a binary classification problem are linearly separable, then the perceptron algorithm will converge. For the sake of this problem, we define convergence as predicting the labels of all training instances perfectly.  The perceptron algorithm is described in Algorithm 1.  It gets access to a dataset of n instances (xi ,yi ), where xi  ∈ Rd and yi  ∈ {−1, 1}. It outputs a linear classifier w.

Algorithm 1 Perceptron

while not converged do

Pick a data point (xi ,yi ) randomly

Make a prediction yˆ = sgn(wT xi ) using current w

if yˆ  yi then

w w + yixi

end if

end while

Assume there exists an optimal hyperplane wopt , ∥wopt ∥2  = 1 and some V > 0 such that yi (wop(T)txi ) ≥ V,∀i ∈ {1, 2, . . . ,n}. Additionally, assume ∥xi ∥2  ≤ R,∀i ∈ {1, 2, ...,n}. Following the steps below, show that the percep-

tron algorithm makes at most  errors, and therefore the algorithm must converge.

1.1 (5pts) Show that if the algorithm makes a mistake, the update rule moves it towards the direction of the optimal weights wopt . Specifically, denoting explicitly the updating iteration index by k, the current weight vector by wk , and the updated weight vector by wk+1, show that, if yi (wk(T)xi ) < 0, we have

wk(T)+1 wopt  wk(T)wopt + γwopt 2 .                                                          (1)

1.2 (4pts) Show that the length of updated weights does not increase by a large amount. In particular, show that if yi (wk(T)xi ) < 0, then

wk+1∥2(2)  ≤ ∥wk ∥2(2) + R2 .                                                                    (2)

1.3 (5pts) Assume that the initial weight vector w0  = 0 (an all-zero vector). Using results from the previous two parts, show that for any iteration k +1, with M being the total number of mistakes the algorithm has made for the first k iterations, we have

γ≤ ∥wk+12  R^M .                                                               (3)

Hint: use the Cauchy-Schwartz inequality: ab ≤ ∥a2 b2 .

1.4 (1pts) Use 1.3 to conclude that M ≤ R2 /γ 2 . Therefore the algorithm can make at most R2 /γ 2 mistakes (note that there is no direct dependence on the dimensionality of the datapoints).

Problem 2: Logistic Regression (10pts)

This problems explores some properties of the logistic function to get you more comfortable with it.  Consider the following univariate function first:

F(x;A,k,b) =                                                              (4)

2.1 (3pts) Describe with words how A,k and b affect or are related to the shape of F(x;A,k,b) by using the plot at https://www.geogebra.org/graphing/mxw7wp8b. Note: We are trying to learn the parameters of a logistic function that can fit the dataset. For example, see the figure below and guess which of the four functions fit well.

 

Figure 1: Learning the parameters of logistic function for a dataset.

2.2 (1pt) For what values of A,k and b is F(x;A,k,b) a cumulative distribution function (CDF)?

2.3 (2pts) Now let F(x;A,k,b) = Pr[y = 1|x;A,k,b], the probability of x belonging to the class 1 in a binary

classification problem. This is similar to the logistic model we saw in Lecture 2. You can verify that for the values of

A,k,b you found in the previous part, F(x;A,k,b) is a valid probability as well. Suppose we know F(x;A,k,b) for a

datapoint x and and want to predict its label. We need to decide on a threshold value for F(x;A,k,b), above which we

will predict the label 1 and below which we will predict 1. Show that setting the threshold to be F(x;A,k,b) 1/2

For the logistic model, the value x = x0 for which F(x0 ) = 0.5 is called the decision boundary. In the univariate case, it is a point in on the x-axis, in the multivariate case (next question), it is a hyperplane. There is a rich literature on decision theory which studies how to make decisions if we know the probabilities of the underlying events (see https://mlstory.org/pdf/prediction.pdfto learn more if you’re interested).

2.4 (4pts) We now consider the multivariate version. Consider the function:

F(x;w,b) =

1

1 + e wT x+b .

(5)

You can explore a 2D version of this function at https://www.geogebra.org/3d/g9fjtdehwhere w = (w1 ,w2 ). Match the items in the left column with those on the right. It will help if you can get the expression for the gradient of F(x;w,b) w.r.t x. Provide a short explanation of your answers.

1) level sets of F(x;w,b)

2) direction of gradient at any point

3) distance of the level set F(x) = 1/2 from the origin

a) parallel to w

b) ||w||/4

c) |b|/||w||

d) |b|

e) orthogonal to w

Problem 3: Learning rectangles (15pts)

An axis aligned rectangle classifier is a classifier than assigns the value 1 to a point if and only if it is inside a certain rectangle.  Formally, given real numbers a1   ≤ b1 ,a2   ≤ b2 , define the classifier f(a1 ,b1 ,a2 ,b2 )  on an input x with coordinates (x1 ,x2 ) by

f(a1 ,b1 ,a2 ,b2 ) (x1 ,x2 ) = { 1   ot(if)h(a)1erw(≤)i se(x)1  b1 and a2  x2  b2

The function class of all axis-aligned rectangles in the plane is defined as

Fre(2)c = {f(a1 ,b1 ,a2 ,b2 ) (x1 ,x2 ) : a1  b1 ,a2  b2 }.

We will assume that the true labels y of the datapoints (x,y) are given by some axis-aligned rectangle (this is the realizability assumption discussed in class).  The goal of this question is to come up with an algorithm which gets small classification error with respect to any distribution D over (x,y) with good probability.

The loss function we use throughout the question is the 0-1 loss. It will be convenient to denote a rectangle marked by corners (a1 ,b1 ,a2 ,b2 ) as B(a1 ,b1 ,a2 ,b2 ). Let B = B(a1(∗),b1(∗),a2(∗),b2(∗)) be the rectangle corresponding to the func- tion f(a1(*),b1(*),a2(*),b2(*)) which labels the datapoints. Let S = {(xi ,yi ),i ∈ [n]} be a training set of n samples drawn i.i.d. from D. Please see Fig. 2for an example.

3.1 (3pts) We will follow the general supervised learning framework from class. Given the 0-1 loss, and the func- tion class of axis-aligned rectangles, we want to find an empirical risk minimizer. Consider the algorithm which returns the smallest rectangle enclosing all positive examples in the training set. Prove that this algorithm is an empirical risk minimizer.

3.2 (2pts) Our next task is to show that the algorithm from the previous part not only does well on the training data, but also gets small classification error with respect to the distribution D. Let BS  be the rectangle returned by the algorithm in the previous part on training set S, and let fS(ERM)  be the corresponding hypothesis. First, we will convince ourselves that generalization is inherently a probabilistic statement. Let a bad training set S\ be a training set such that R(fRM )  ≥ 0.5.  Pick some simple distribution D and ground-truth rectangle B , and give a short explanation for why there is a non-zero probability of seeing a bad training set.

3.3 (5pts) We will now show that with good probability over the training dataset S, fS(ERM)  does get small error.

Show that if n  then with probability at least 1 δ , R(fS(ERM)) ϵ .

 

Figure 2: Learning axis-aligned rectangles in two dimensions, here + (red) denotes datapoints in training set S with

label 1 and (green) denotes datapoints with label -1. The true labels are given by rectangle B (solid blue line), with

learned by the algorithm in Part (a). B1 (dashed black line) is defined in Part (c).

To prove this follow the following steps. Let a1  ≥ a1(∗) be such that the probability mass (with respect to D) of the rectangle B1  = B(a1(∗),a1 ,a2(∗),b2(∗)) is exactly ϵ/4. Similarly, let b1 ,a2 ,b2  be numbers such that the probability mass (with respect to D) of the rectangles B2  = B(b1 ,b 1(∗),a2(∗),b2(∗)),B3  = B(a1(∗),b 1(∗),a2(∗),a2 ),B4  = B(a1(∗),b 1(∗),b2 ,b2(∗)) are all exactly ϵ/4.

• Show that BS  ⊆ B .

• Show that if S contains (positive) examples in all of the rectangles B1 ,B2 ,B3 ,B4 then R(fS(ERM)) ≤ ϵ .

• For each i ∈ {1, . . . , 4} upper bound the probability that S does not contain an example from Bi .

• Use the union bound to conclude the argument.

3.4 (5pts) Repeat the previous question for the function class of axis-aligned rectangles in Rd . To show this, try to generalize all the steps in the above question to higher dimensions, and find the number of training samples n required to guarantee that R(fS(ERM)) ≤ ϵ with probability at least 1 − δ .

Programming-based Questions

Before you start to conduct homework in this part, you need to first set up the coding environment. We use python3 (version ≥ 3.7) in our programming-based questions. There are multiple ways you can install python3, for example:

• You can use condato configure a python3 environment for all programming assignments.

• Alternatively, you can also use virtualenvto configure a python3 environment for all programming assignments After you have a python3 environment, you will need to install the following python packages:

• numpy

• matplotlib (for you plotting figures)

Note: You are not allowed to use other packages, such as tensorflow, pytorch, keras, scikit-learn, scipy, etc. to help you implement the algorithms you learned. If you have other package requests, please ask first before using them.

Problem 4: k-nearest neighbor classification and the ML pipeline (25pts)

In class, we talked about how we need to do training/test split to make sure that our model is generalizing well. We also discussed how we should not reuse a test set too much, because otherwise the test accuracy may not be an accurate measure of the accuracy on the data distribution. In reality, a ML model often has many hyper-parameters which need to be tuned (we will see an example in this question). We don’t want to use the test set over and over to see what the best value of this hyper-parameter is. The solution is to have a third split of the data, and create a validation set. The idea is to use the validation set to evaluate results from the training set and tune any hyper-parameters. Then, use the test set to double-check your evaluation after the model has“passed”the validation set. Please see this nice explana- tion for more discussion: https://developers.google.com/machine-learning/crash-course/ validation/another-partition.

With this final piece, we are now ready to build a real ML pipeline. It usually conducts three parts. (1) Load and pre-process the data.  (2) Train a model on the training set and use the validation set to tune hyper-parameters.  (3) Evaluate the final model on the test set and report the result.

In this problem, you will implement the k-nearest neighbor (k-NN) algorithm to conduct classification tasks. We provide the bootstrap code and you are expected to complete the classes and functions. You can find the code and data on https://vatsalsharan.github.io/fall22/knn_hw1.zip.

k-NN algorithm

The k-nearest neighbor (k-NN) algorithm is one of the simplest machine learning algorithms in the supervised learning paradigm. The idea behind k-NN is simple, and we explain it first for the case of k = 1. The 1-NN algorithm predicts the label of any new datapoint x by finding its closest neighbor x\ in the training set, and then predicts the label of x to be the same as the label of x\. For general k, the k-NN algorithm predicts the label by taking a majority vote on the k nearest neighbors.

We now describe the algorithm more rigorously. Given a hyper-parameter k, training instances (xi ,yi ) (xi  ∈ Rd and yi is the label), and a test example x, the k-NN algorithm can be executed based on the following steps,

1. Calculate the distances between the test example and each of the training examples.

2. Take the k nearest neighbors based on the distances calculated in the previous step.

3. Among these k nearest neighbors, count the number of the data points in each class.

4. Predict the label yˆ of x to be the most frequent class among these neighbors (we describe how to break any ties later).

You are asked to implement the missing functions in knn.py  following each of the steps.

Part 4.1 Report 4-nearest neighbor accuracy (8pts)

Euclidean distance calculation    Compute the distance between the data points based on the following equation:

L

d(x, x\) = x x\ 2  = /l (xi x)2 .                                                    (6)

Task: Fill in the code for the function  compute l2 distances .

k-NN classifier    Implement a k-NN classifier based on the above steps. Your algorithm should output the predictions for the test set. Note: You do not need to worry about ties in the distance when finding the k nearest neighbor set. However, when there are ties in the majority label of the k nearest neighbor set, you should return the label with the smallest index. For example, when k = 4, if the labels of the 4 nearest neighbors happen to be 0, 0, 1, 1, your prediction should be the label 0.

Task: Fill in the code for the function predict labels .

Report the error rate    We want you to report the error rate for the classification task. The error rate is defined as:

error rate =  .                                                 (7)

Task: Fill in the code for the function  compute error rate .

Report Item: Report the error rate of your k nearest neighbor algorithm in the validation set when k  = 4 using

Euclidean distance.

Part 4.2 Data transformation (2+2pts)

We are going to add one more step (data transformation) in the data processing part and see how it works. Data trans- formations and other feature engineering steps often play a crucial role to make a machine learning model work. Here, we take two different data transformation approaches.  This link might be helpful: https://en.wikipedia. org/wiki/Feature_scaling.

Normalizing the feature vector    This one is simple but sometimes may work well.  Given a feature vector x, the

normalized feature vector is given by x~ =  . If a vector is an all-zero vector, define the normalized vector to also

be the all-zero vector (in practice, a useful trick is to add a very very small value to the norm of x, so that we do not need to worry about the division by zero error).

Min-max scaling for each feature    The above normalization is independent of the rest of the data.  On the other hand, min-max normalization scales each sample in a way that depends on the rest of the data. More specifically, for each feature in the training set, we normalize it linearly so that its value is between 0 and 1 across all samples, and in addition, the largest value becomes exactly 1 while the smallest becomes exactly 0. Then, we will apply the same scaling parameters we get from training data to our testing instances.

Task: Fill in the code for the function  data processing with transformation .

Report Item: Report the error rate of your k nearest neighbor algorithm in the validation set for k = 4 using Euclidean distance when data is using (1) Normalized featured vector, and (2) Min-max scaling featured vector.

Part 4.3 Different distance measurement (3pts)

In this part, we will change the way that we measure distances. We will work with the original (unnormalized) data for simplicity, and continue the experiments from Part 4.1.

Cosine distance calculation    Compute the distance between data points based on the following equation:

d(x, x\) =

(

1

1

 

x · x\

x2 x\ 2

 

if x2  = 0 or x\ 2  = 0 otherwise.

Similar to when we are normalizing the feature vector, a useful trick in practice is to add a very very small value to the norm of x, so that we do not need to worry about the division by zero issues.

Task: Fill in the code for the function compute cosine distances  and change distance function used in the main func- tion in the code to get results.

Report Item: Report the error rate of your k nearest neighbor algorithm in the validation set for k = 4 using cosine distance for original data.

Part 4.4 Tuning the hyper-parameter k (10pts)

Again, follow Part 4.1, however, this time we conduct experiments with k = {1, 2, 4, 6, 8, 10, 12, 14, 16, 18}. Task: Fill in the code for the function  find best k .

Report Item: (1) Report and draw a curve based on the error rate of your model on the training set for each k. What do you observe? (2pts) (2) Report and draw a curve based on the error rate of your model on the validation set for each k. What is your best k? (2pts) (3) What do you observe by comparing the difference between the two curves? (2pts) (4) What’s the final test set error rate you get using your best-k? (1pt) (5) Comment on these results (3pts).

Problem 5: Linear Regression (20pts)

We will consider the problem of fitting a linear model. Given d-dimensional input data x1 , ··· , xn  ∈ Rd  with real- valued labels y1 , ··· ,yn  ∈ R, the goal is to find the coefficient vector w that minimizes the sum of the squared errors.

The total squared error of w can be written as F(w) =  fi (w), where fi (w) =  (wTxi  − yi )2  denotes the

The data in this problem will be drawn from the following linear model. For the training data, we select n data  points x1 , ··· , xn , each drawn independently from a d-dimensional Gaussian distribution. We then pick the “true” coefficient vector w* (again from a d-dimensional Gaussian), and give each training point xi a label equal to (w*)xi   plus some noise (which is drawn from a 1-dimensional Gaussian distribution).

The following Python code will generate the data used in this problem.

d  =  100  #  dimensions  of  data

n  =  1000  #  number  of  data  points

X  =  np .random .normal(0,1,  size=(n,d))

w_true  =  np .random .normal(0,1,  size=(d,1))

y  =  X.dot (w_true)  +  np.random.normal(0,0.5,size=(n,1))

5.1 (6pts) Least-squares regression has the closed form solution wLS   = (XT X)1XTy, which minimizes the squared error on the data. (Here X is the n × d data matrix as in the code above, with one row per data point, and y is the n-vector of their labels.) Solve for wLS on the training data and report the value of the objective function F(wLS ). For comparison, what is the total squared error F(w) if you just set w to be the all 0’s vector? Using similar Python code, draw n = 1000 test data points from the same distribution, and report the total squared error of wLS  on these test points. What is the gap in the training and test objective function values? Comment on the result.

Note: Computing the closed-form solution requires time O(nd2 + d3 ), which is slow for large d. Although gra- dient descent methods will not yield an exact solution, they do give a close approximation in much less time. For the purpose of this assignment, you can use the closed form solution as a good sanity check in the following parts.

5.2 (7pts) In this part, you will solve the same problem via gradient descent on the squared-error objective function F(w) =  fi (w). Recall that the gradient of a sum of functions is the sum of their gradients. Given a point wt , what is the gradient of f at wt ?

Now use gradient descent to find a coefficient vector w that approximately minimizes the least squares objective function over the training data. Run gradient descent three times, once with each of the step sizes 0.00005, 0.0005, and 0.0007. You should initialize w to be the all-zero vector for all three runs. Plot the objective function value for 20 iterations for all 3 step sizes on the same graph.  Comment in 3-4 sentences on how the step size can affect the convergence of gradient descent (feel free to experiment with other step sizes). Also report the step size that had the best final objective function value and the corresponding objective function value.

5.3 (7pts) In this part you will run stochastic gradient descent to solve the same problem. Recall that in stochastic gradient descent, you pick one training datapoint at a time, say (xi ,yi ), and update your current value of w according

to the gradient of fi (w) = (wTxi yi )2 .