Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS229 Final Exam - Summer 2019

1    [10 points] True or False

Each question is for 1 point.  To earn 1 point per question, provide the correct True or False answer along with the correct justification. Incorrect answers or incorrect justifications will earn 0 points. Justifications must be short (a couple of sentences). There will be no negative points.

1. True or False.  Overfitting occurs when models are too complex” for a given situation.  One way to prevent it is by adding a regularization penalty. Your friend proposes an alternative: to train the model on fewer training data which is then less likely to overfit. Your friend is correct.

2. True  or  False.   Given a model y  =  log (x1(θ)1 x2(3)θ2 ) x3(5)  + e, where x1 ,x2   R+ ,  y,x3 ,θ 1 ,θ2   R, e ∼ N(0, 1) i.i.d., the Maximum Likelihood value of the model parameters (θ) can be estimated with linear regression.

3. True or False. In a binary classification problem, let’s consider an example that is correctly classified and sufficiently far from the decision boundary. Claim: Logistic regression’s decision boundary will be unaffected by this point but the one learnt by SVM will be affected.

4. True or False. Consider a Bernoulli distribution with parameter θ ∈ [0, 1]. The Maximum Likelihood estimate (MLE) of the parameter θ is equal to the Maximum-a-posteriori (MAP) estimate when the prior probability p(θ) is the Uniform distribution over [0,1].

5. True or False. The more features that we use to represent our data, the better the learning algorithm will generalize to new data points.

6. True or False. Like the Logistic regression algorithm, the Perceptron algorithm does not converge if the training examples are perfectly separable.

7. True or  False.  Given an arbitrary Neural Network where we use Stochastic Gradient Descent to optimize the parameters, initializing all the weights to 0 is a good practice.

8. True or False.  The Exponential Family of probability distributions are obtained as a natural con- sequence of the Maximum Entropy principle.  The entropy of any exponential family distribution is therefore +∞ .

9. True  or  False.   The Kernel Trick enables us to restructure learning algorithms and make them scalable to very large number of training examples, and with certain kernels even to an infinite number of examples.

10. True or False. Gaussian Processes enjoy a closed form solution once the kernel matrix has been eval- uated. Alternatively, they can also be trained with Gradient Ascent/Descent (or Stochastic Gradient Ascent/Descent).

2    [10 points] Short Answers

Each sub-question is worth 2 points. Provide a short justification (one or two lines) for each answer.

1.  Short Answer.  How many parameters are necessary to specify a Naive Bayes classifier under the Bernoulli Event Model  assumption where the size of the vocabulary is d words, and there are three

classes?

2.  Short  Answer.  How many parameters are necessary to specify a Naive Bayes classifier with the Multinomial Event Model assumption where the size of the vocabulary is d words, and there are three

classes?

3.  Short Answer.  Consider a Markov Decision Process (MDP) is defined as (S,A,{Psa },γ,R), where S is a finite set of states of size |S|, A is a finite set of actions of size |A|, {Psa } are the state transition probabilities, γ ∈ [0, 1) is the discount factor, and R is the reward function. How many distinct policies π : S → A can possibly be defined for this MDP?

4.  Short Answer.  In Reinforcement Learning, unlike the Value Iteration algorithm, the Fitted Value Iteration algorithm does not have a guarantee of convergence to the true optimal value function V . This could be because V may not be representable by our model class. Would you consider this as a High Bias or High Variance scenario?

5.  Short Answer. In the EM algorithm, we are given a training set S = {x(i)} that corresponds to a model p(x,z;θ). We do not observe the corresponding z (i)’s, which are also called the latent variables. Our goal is to maximize  log p(x(i);θ) w.r.t θ .

Suppose our dataset is not too large (i.e n is not very big), and out of concern of overfitting, we decide to add regularization in the M-step, resulting in the following modified algorithm:

• Repeat until convergence:

  E-Step. For each i = 1, . . . ,n, set

Qt) (z(i)) = p(z(i) |x(i) ;θ(t))

  M-Step. Update the parameters with

n

θ (t+1)  = arg m x  ELBO(xθ(a) (i);Qt),θ) + λθ2(2)

i=1

Is the above algorithm guaranteed to monotonically increase  log p(x(i);θ) with each iteration? Provide an informal justification.

3    [10 points] Adding features to Linear regression

In our study of Bias and Variance, we have seen that adding more features (and hence parameters) to a model can increase the variance.  One way to think about variance is as the difference between test error and training error. So, generally, increasing the variance tends to reduce training error (and/or increase test error). In the case of linear regression, we can actually show that adding a new feature (it could be anything, even random numbers!) will always reduce the training error (or keep it the same).

Let X ∈ Rn ×d be a design matrix, and ⃗y ∈ Rn the labels associated with it. Let θ ∈ Rd be the parameters of the linear regression model. Now we will add a new random column ⃗r ∈ Rn to the design matrix, resulting

in an updated design matrix

  Rn × (d+1)  = [X,⃗r] .

Note that this new feature is not a random variable, but just a pre-computed vector of random numbers, that are unrelated to X .  Correspondingly, we add a new parameter t  ∈ R to the model, resulting in an updated parameter vector

Θ Rd+1  = [ ]t(θ) .

Let us define the projection and residual matrices of X  (assuming X  is full-rank, so X T X  is invertible) as follows:

P = X(XT X)1X T ,

R = P − I,

where for example for any vector ⃗v ∈ Rn , P⃗v is the projection of the vector ⃗v onto the range of X , I is the Identity matrix, and R⃗v is the residual P⃗v − ⃗v, i.e. the projection minus the original vector.

Recall that the cost function for linear regression is the squared error. Let us denote the cost functions of the two models as

J(θ) = ||Xθ − ⃗y||2(2)

J˜(Θ) = ||Θ − ⃗y||2(2)

We basically need to show that

min J˜(Θ) = J˜≤ J = minJ(θ)

Θ                                                          θ

We will break down the task into sub-problems and solve them in steps. The overall strategy will be to minimize J˜(Θ) in two stages, first with respect to a subset of its parameters θ, and next with respect to the remaining parameter t . This will allow us to show that its minima is smaller than the minima of J(θ). i.e., across the following sub-questions, we will show

t(mi)n ( θ(mi)n J˜(Θ)) = J − (·)2 ,

which will complete the proof.

1.  [2 points] Minimizer of J(θ)

Calculate J, the lowest value achieved by J(θ) in the form of a closed form expression. Your expression can involve only R and ⃗y terms.

2.  [2 points] Estimate θ

We will now estimate parameters (closed form) that minimize the updated loss J˜(Θ). We will do that by separately estimating θˆ and tˆ.  In this sub-problem we will estimate θˆ.  Calculate θˆ by solving for θ  in the equation ∇θ J˜(Θ) = 0. You may refer to lecture notes and cite the appropriate filename and page number to shorten your calculations. Your expression for θˆ can only include X , ⃗y, ⃗r and t terms.

3.  [2 points] Updated J˜

Plug the estimated value of θˆ into J˜(Θ), and call it J˜θˆ(t). This is because once we minimize J˜(Θ) with respect to θ, we will be left with a function of only t . Now express J˜θˆ(t) as J + f(t,⃗r,R,⃗y). Clearly state the expression for f .

4.  [2 points] Estimate t

Now we will estimate t  by solving for t  in the equation ∇t J˜θˆ(t) = 0.  Come up with a closed form expression for tˆ. Your expression can only involve R , ⃗y and ⃗r terms.

5.  [2 points] Minimizer J˜

Now we are ready to calculate the minimizer J˜by plugging in the estimated tˆ into J˜θˆ(t). Show that J˜

Remark: Note, what we have shown is that adding a new feature decreases train error, and we have shown nothing about the test error. It is very likely that if we add a random feature as above, the test error is only going to get worse.

4    [10 points] Regression with weak supervision

In this question we will derive a method to learn a regression model, i.e.  predict y  ∈ R given x  ∈ Rd , with only weak supervision. By weak supervision we mean, rather than observing a point value for y of an example, we are instead given an interval [l,u] such that l ≤ y ≤ u, where u,l ∈ R are the upper and lower ends of the interval. Our training set consists of n 3-tuples {(x(i),l(i),u(i))} where each x(i)  ∈ Rd  is the input, and the corresponding unknown output lies in the range [l(i),u(i)] where l(i),u(i)  ∈ R. Note that the width of each interval can be different for different examples, and the true unobserved y (i) could lie anywhere in that interval, not necessarily at the mid-point. We are also provided a feature map ψ : Rd  → Rp  that we will use for learning.

 

Figure 1: Training Dataset. In place of a point estimate of y, we instead have [l,u] intervals for each example where the y value resides in. The true y is unknown.

We make the assumption that for each example,  y|x is distributed according a Normal distribution N(µ,σ2 ) with mean µ ∈ R and variance σ 2  ∈ R+ .  For simplicity we will assume σ 2  = 1 throughout this problem. We make the linearity assumption

µ = ψ(x)T θ,

where θ ∈ Rp  are the learnable parameters. The predictions from our model on a new example will be hθ (x) = E[y|x;θ].

In order to fit our model to the data, and recognizing the fact that the true y lies in the interval [l,u], we define the likelihood function of θ given the training set as:

n

(θ) = log u P (l(i)  y(i)  u(i)  | x(i);θ) .

i=1

Based on the above likelihood objective, we will use gradient ascent to perform maximum likelihood estimation of the parameters θ .

1.  [5  points] Derive the full-batch gradient ascent update rule for the above log-likelihood objective. You may use the notation ϕµ (y) and Φµ (y) to be the probability density function and cumulative

distribution function of a Normal distribution with mean µ (and σ 2  = 1) evaluated at y . You may also leave the update rule in terms of these functions.

2.  [2  points]  Assuming we have found the model parameters θˆ (using gradient ascent),  provide an

expression for the hypothesis to make a prediction on an unseen example as hθˆ(x) = E[y|x; θˆ] = ...

3.  [3  points] Use the provided starter code to fill in the update rule and the prediction rule derived above. The starter code already uses a suitable feature map. Run the starter code and obtain a plot of the learned hypothesis against the training set. Include only the obtained plot as your answer. You do not need to submit your code.

You may use the norm module from the scipy .stats package (already imported in the starter code), in particular, the methods norm .cdf() and norm .pdf() to evaluate the cumulative distribution and probability density function of the Standard Normal distribution.