闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS 229, Summer 2022

Problem Set #1

1. [40 points] Linear Classiﬁers (logistic regression and GDA)

In this problem, we cover two probabilistic linear classiﬁers we have covered in class so far. First, a discriminative linear classiﬁer: logistic regression. Second, a generative linear classiﬁer: Gaussian discriminant analysis (GDA). Both the algorithms ﬁnd a linear decision boundary that separates the data into two classes, but make diﬀerent assumptions. Our goal in this problem is to get a deeper understanding of the similarities and diﬀerences (and, strengths and weaknesses) of these two algorithms.

For this problem, we will consider two datasets, along with starter codes provided in the following ﬁles:

❼ src/linearclass/ds1_{train,valid} .csv

❼ src/linearclass/ds2_{train,valid} .csv

❼ src/linearclass/logreg.py

❼ src/linearclass/gda .py

Each ﬁle contains n examples, one example (x(i), y (i)) per row. In particular, the i-th row contains columns x i) e R, xi) e R, and y (i) e {0, 1}. In the subproblems that follow, we will investigate using logistic regression and Gaussian discriminant analysis (GDA) to perform binary classiﬁcation on these two datasets.

(a) [10 points]

In lecture we saw the average empirical loss for logistic regression:

J(θ) = - ╱y (i) log(hθ (x(i))) + (1 - y (i)) log(1 - hθ (x(i)))、,

where y (i) e {0, 1}, hθ (x) = g(θT x) and g(z) = 1/(1 + e-z ).

Find the Hessian H of this function, and show that for any vector z, it holds true that zT Hz ≥ 0.

Hint: You may want to start by showing that )i )j zi xi xj zj = (xT z)2 ≥ 0. Recall also that g\ (z) = g(z)(1 - g(z)).

Remark: This is one of the standard ways of showing that the matrix H is positive semi- deﬁnite, written “H > 0.” This implies that J is convex, and has no local minima other than the global one. If you have some other way of showing H > 0, you’re also welcome to use your method instead of the one above.

(b) [5 points] Coding problem. Follow the instructions in src/linearclass/logreg .py

to train a logistic regression classiﬁer using Newton’s Method. Starting with θ = , run Newton’s Method until the updates to θ are small: Speciﬁcally, train until the ﬁrst iteration k such that |θk - θk - 1 |1 < c, where c = 1x10-5 . Make sure to write your model’s predicted probabilities on the validation set to the ﬁle speciﬁed in the code.

Include a plot of the validation data with x1 on the horizontal axis and x2 on the vertical axis. To visualize the two classes, use a diﬀerent symbol for examples x(i) with y (i) = 0 than for those with y (i) = 1. On the same ﬁgure, plot the decision boundary found by logistic regression (i.e, line corresponding to p(y|x) = 0.5).

p(y) =

p(x|y = 0) = exp ╱ - (x - µ0 )T Σ - 1 (x - µ0 )、 p(x|y = 1) = exp ╱ - (x - µ 1 )T Σ - 1 (x - µ 1 )、,

where φ , µ0 , µ 1 , and Σ are the parameters of our model.

Suppose we have already ﬁt φ , µ0 , µ 1 , and Σ, and now want to predict y given a new point x. To show that GDA results in a classiﬁer that has a linear decision boundary, show the posterior distribution can be written as

p(y = 1 | x; φ, µ0 , µ 1 , Σ) =

1 + exp(-(θT x + θ0 )) ,

where θ e Rd and θ0 e R are appropriate functions of φ, Σ, µ0 , and µ 1 .

(d) [7 points] Given the dataset, we claim that the maximum likelihood estimates of the pa- rameters are given by

) 1{y(i) = 0}

) 1{y(i) = 1}

Σ = (x(i) - µy (i) )(x(i) - µy (i) )T

The log-likelihood of the data is

y(φ, µ0 , µ 1 , Σ) = log | p(x(i), y (i); φ, µ0 , µ 1 , Σ)

i=1

= log | p(x(i) |y(i); µ0 , µ 1 , Σ)p(y(i); φ).

i=1

By maximizing y with respect to the four parameters, prove that the maximum likelihood estimates of φ , µ0 , µ 1 , and Σ are indeed as given in the formulas above. (You may assume that there is at least one positive and one negative example, so that the denominators in the deﬁnitions of µ0 and µ 1 above are non-zero.)

(e) [5 points] Coding problem. In src/linearclass/gda .py, ﬁll in the code to calculate

φ , µ0 , µ 1 , and Σ, use these parameters to derive θ, and use the resulting GDA model to make predictions on the validation set. Make sure to write your model’s predictions on the validation set to the ﬁle speciﬁed in the code.

(f) [2 points] For Dataset 1, compare the validation set plots obtained in part (b) and part (e)

from logistic regression and GDA respectively, and brieﬂy comment on your observation in a couple of lines.

(g) [5 points] Repeat the steps in part (b) and part (e) for Dataset 2. Create similar plots on

the validation set of Dataset 2 and include those plots in your writeup.

On which dataset does GDA seem to perform worse than logistic regression? Why might this be the case?

(h) [1 points] For the dataset where GDA performed worse in parts (f) and (g), can you ﬁnd

a transformation of the x(i)’s such that GDA performs signiﬁcantly better? What might this transformation be?

2. [30 points] Incomplete, Positive-Only Labels

In this problem we will consider training binary classiﬁers in situations where we do not have full access to the labels. In particular, we consider a scenario, which is not too infrequent in real life, where we have labels only for a subset of the positive examples. All the negative examples and the rest of the positive examples are unlabelled.

We formalize the scenario as follows. Let {(x(i), t(i))} be a standard dataset of i.i.d distributed examples. Here x(i)’s are the inputs/features and t(i) are the labels. Now consider the situation where t(i)’s are not observed by us. Instead, we only observe the labels of some of the positive examples. Concretely, we assume that we observe y(i)’s that are generated by

Ax, p(y(i) = 1 | t(i) = 1, x(i) = x) = α,

Ax, p(y(i) = 0 | t(i) = 1, x(i) = x) = 1 - α

Ax, p(y(i) = 1 | t(i) = 0, x(i) = x) = 0,

Ax, p(y(i) = 0 | t(i) = 0, x(i) = x) = 1

where α e (0, 1) is some unknown scalar. In other words, if the unobserved “true” label t(i) is 1, then with α chance we observe a label y (i) = 1. On the other hand, if the unobserved “true” label t(i) = 0, then we always observe the label y (i) = 0.

Our ﬁnal goal in the problem is to construct a binary classiﬁer h of the true label t, with only access to the partial label y . In other words, we want to construct h such that h(x(i)) s p(t(i) = 1 | x(i)) as closely as possible, using only x and y .

Real world example: Suppose we maintain a database of proteins which are involved in transmit- ting signals across membranes . Every example added to the database is involved in a signaling process, but there are many proteins involved in cross-membrane signaling which are missing from the database . It would be useful to train a classiﬁer to identify proteins that should be added to the database . In our notation, each example x(i) corresponds to a protein, y (i) = 1 if the protein is in the database and 0 otherwise, and t(i) = 1 if the protein is involved in a

cross-membrane signaling process and thus should be added to the database, and 0 otherwise . For the rest of the question, we will use the dataset and starter code provided in the following ﬁles:

❼ src/posonly/{train,valid,test} .csv

❼ src/posonly/posonly .py

Each ﬁle contains the following columns: x1 , x2 , y, and t. As in Problem 1, there is one example per row. The y(i)’s are generated from the process deﬁned above with some unknown α .

(a) [5 points] Coding problem: ideal (fully observed) case

First we will consider the hypothetical (and uninteresting) case, where we have access to the true t-labels for training. In src/posonly/posonly .py, write a logistic regression classiﬁer that uses x1 and x2 as input features, and train it using the t-labels. We will ignore the y-labels for this part. Output the trained model’s predictions on the test set to the ﬁle speciﬁed in the code.

Create a plot to visualize the test set with x1 on the horizontal axis and x2 on the vertical axis. Use diﬀerent symbols for examples x(i) with true label t(i) = 1 than those with t(i) = 0. On the same ﬁgure, plot the decision boundary obtained by your model (i.e, line corresponding to model’s predicted probability = 0.5) in red color. Include this plot in your writeup.

(b) [5 points] Coding problem: The naive method on partial labels

We now consider the case where the t-labels are unavailable, so you only have access to the y-labels at training time. Extend your code in src/posonly/posonly .py to re-train the classiﬁer (still using x1 and x2 as input features), but using the y-labels only. Output the predictions on the test set to the appropriate ﬁle (as described in the code comments).

Create a plot to visualize the test set with x1 on the horizontal axis and x2 on the vertical axis. Use diﬀerent symbols for examples x(i) with true label t(i) = 1 (even though we only used the y (i) labels for training, use the true t(i) labels for plotting) than those with t(i) = 0. On the same ﬁgure, plot the decision boundary obtained by your model (i.e, line corresponding to model’s predicted probability = 0.5) in red color. Include this plot in your writeup.

Note that the algorithm should learn a function h(.) that approximately predicts the prob- ability p(y(i) = 1 | x(i)). Also note that we expect it to perform poorly on predicting the probability of interest, namely p(t(i) = 1 | x(i)).

In the following sub-questions we will attempt to solve the problem with only partial observations. That is, we only have access to {(x(i), y (i))}, and will try to predict p(t(i) = 1|x(i)).

p(t(i) = 1 | y (i) = 1, x(i)) = 1

(1)

That is, observing a positive partial label y (i) = 1 tells us for sure the hidden true label is 1. Use Bayes rule to derive this (an informal explanation will not earn credit).

(d) [5 points] Show that for any example, the probability that true label t(i) is positive is 1/α times the probability that the partial label is positive. That is, show that

p(t(i) = 1 | x(i)) = . p(y(i) = 1 | x(i))

(2)

Note that the equation above suggests that if we know the value of α, then we can convert a function h(.) that approximately predicts the probability h(x(i)) s p(y(i) = 1 | x(i)) into a function that approximately predicts p(t(i) = 1 | x(i)) by multiplying the factor 1/α .

(e) [5 points] Estimating α

The solution to estimate p(t(i)|x(i)) outlined in the previous sub-question requires the knowl- edge of α which we don’t have. Now we will design a way to estimate α based on the function h(.) that approximately predicts p(y(i) = 1 | x(i)) (which we obtained in part b).

To simplify the analysis, let’s assume that we have magically obtained a function h(x) that perfectly predicts the value of p(y(i) = 1 | x(i)), that is, h(x(i)) = p(y(i) = 1 | x(i)). We make the crucial assumption that p(t(i) = 1 | x(i)) e {0, 1}. This assumption means that the process of generating the “true” label t(i) is a noise-free process. This assumption is not very unreasonable to make. Note, we are NOT assuming that the observed label y (i) is noise-free, which would be an unreasonable assumption!

Now we will show that:

α = 匝[h(北(i)) | y (i) = 1] (3)

The above result motivates the following algorithm to estimate α by estimating the RHS of the equation above using samples: Let V+ be the set of labeled (and hence positive) examples in the validation set V , given by V+ = {北(i) e V | y (i) = 1}.

Then we use

α s ù h(北(i)).

to estimate α . (You will be asked to implement this algorithm in the next sub-question. For this sub-question, you only need to show equation (3). Moreover, this sub-question may be slightly harder than other sub-questions.)

(f) [5 points] Coding problem.

Using the validation set, estimate the constant α by averaging your classiﬁer’s predictions over all labeled examples in the validation set:1

α s ù h(北(i)).

Add code in src/posonly/posonly .py to rescale your predictions h(y(i) = 1 | 北(i)) of the classiﬁer that is obtained from part b, using the equation (2) obtained in part (d) and using the estimated value for α .

Finally, create a plot to visualize the test set with 北1 on the horizontal axis and 北2 on the vertical axis. Use diﬀerent symbols for examples 北(i) with true label t(i) = 1 (even though we only used the y (i) labels for training, use the true t(i) labels for plotting) than those with t(i) = 0. On the same ﬁgure, plot the decision boundary obtained by your model (i.e, line corresponding to model’s adjusted predicted probability = 0.5) in red color. Include this plot in your writeup.

Remark: We saw that the true probability p(t | 北) was only a constant factor away from p(y | 北). This means, if our task is to only rank examples (i. e . sort them) in a particular order (e.g, sort the proteins in order of being most likely to be involved in transmitting signals across membranes), then in fact we do not even need to estimate α . The rank based on p(y | 北) will agree with the rank based on p(t | 北).

3. [25 points] Poisson Regression

In this question we will construct another kind of a commonly used GLM, which is called Poisson Regression. In a GLM, the choice of the exponential family distribution is based on the kind of problem at hand. If we are solving a classiﬁcation problem, then we use an exponential family distribution with support over discrete classes (such as Bernoulli, or Categorical). Simiarly, if the output is real valued, we can use Gaussian or Laplace (both are in the exponential family). Sometimes the desired output is to predict counts, for e.g., predicting the number of emails expected in a day, or the number of customers expected to enter a store in the next hour, etc. based on input features (also called covariates). You may recall that a probability distribution with support over integers (i.e. counts) is the Poisson distribution, and it also happens to be in the exponential family.

In the following sub-problems, we will start by showing that the Poisson distribution is in the exponential family, derive the functional form of the hypothesis, derive the update rules for training models, and ﬁnally using the provided dataset train a real model and make predictions on the test set.

(a) [5 points] Consider the Poisson distribution parameterized by λ:

e-λ λy

(Here y has positive integer values and y! is the factorial of y . ) Show that the Poisson distribution is in the exponential family, and clearly state the values for b(y), η , T (y), and a(η).

(b) [3 points] Consider performing regression using a GLM model with a Poisson response

variable. What is the canonical response function for the family? (You may use the fact that a Poisson random variable with parameter λ has mean λ .)

(c) [7 points] For a training set {(x(i), y (i)); i = 1, . . . , n}, let the log-likelihood of an example be log p(y(i)|x(i); θ). By taking the derivative of the log-likelihood with respect to θj , derive the stochastic gradient ascent update rule for learning using a GLM model with Poisson responses y and the canonical response function.

(d) [10 points] Coding problem

Consider a website that wants to predict its daily traﬃc. The website owners have collected a dataset of past traﬃc to their website, along with some features which they think are useful in predicting the number of visitors per day. The dataset is split into train/valid sets and the starter code is provided in the following ﬁles:

❼ src/poisson/{train,valid} .csv

❼ src/poisson/poisson .py

We will apply Poisson regression to model the number of visitors per day. Note that ap- plying Poisson regression in particular assumes that the data follows a Poisson distribution whose natural parameter is a linear combination of the input features (i. e ., η = θ T x). In src/poisson/poisson.py, implement Poisson regression for this dataset and use full batch gradient ascent to maximize the log-likelihood of θ . For the stopping criterion, check if the change in parameters has a norm smaller than a small value such as 10-5 .

Using the trained model, predict the expected counts for the validation set, and create a scatter plot between the true counts vs predicted counts (on the validation set). In the scatter plot, let x-axis be the true count and y-axis be the corresponding predicted expected count. Note that the true counts are integers while the expected counts are generally real values.

4. [15 points] Convexity of Generalized Linear Models

In this question we will explore and show some nice properties of Generalized Linear Models, speciﬁcally those related to its use of Exponential Family distributions to model the output.

Most commonly, GLMs are trained by using the negative log-likelihood (NLL) as the loss func- tion. This is mathematically equivalent to Maximum Likelihood Estimation (i. e ., maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood). In this problem, our goal is to show that the NLL loss of a GLM is a convex function w.r.t the model parameters. As a reminder, this is convenient because a convex function is one for which any local minimum is also a global minimum, and there is extensive research on how to optimize various types of con- vex functions eﬃciently with various algorithms such as gradient descent or stochastic gradient descent.

To recap, an exponential family distribution is one whose probability density can be represented p(y; η) = b(y) exp(ηT T (y) - a(η)),

where η is the natural parameter of the distribution. Moreover, in a Generalized Linear Model, η is modeled as θ T x, where x e Rd are the input features of the example, and θ e Rd are learnable parameters. In order to show that the NLL loss is convex for GLMs, we break down the process into sub-parts, and approach them one at a time. Our approach is to show that the second derivative (i. e ., Hessian) of the loss w.r.t the model parameters is Positive Semi-Deﬁnite (PSD) at all values of the model parameters. We will also show some nice properties of Exponential Family distributions as intermediate steps.

For the sake of convenience we restrict ourselves to the case where η is a scalar. Assume p(Y |X; θ) ~ ExponentialFamily(η), where η e R is a scalar, and T (y) = y . This makes the exponential family representation take the form

p(y; η) = b(y) exp(ηy - a(η)).

(a) [5 points] Derive an expression for the mean of the distribution. Show that 匝[Y ; η] = a(η)

(note that 匝[Y ; η] = 匝[Y |X; θ] since η = θ T x). In other words, show that the mean of an exponential family distribution is the ﬁrst derivative of the log-partition function with respect to the natural parameter.

Hint: Start with observing that ( p(y; η)dy = ( p(y; η)dy .

(b) [5 points] Next, derive an expression for the variance of the distribution. In particular, show that Var(Y ; η) = a(η) (again, note that Var(Y ; η) = Var(Y |X; θ)). In other words, show that the variance of an exponential family distribution is the second derivative of the log- partition function w.r.t. the natural parameter.

Hint: Building upon the result in the previous sub-problem can simplify the derivation.

(c) [5 points] Finally, write out the loss function y(θ), the NLL of the distribution, as a function of θ . Then, calculate the Hessian of the loss w.r.t θ, and show that it is always PSD. This concludes the proof that NLL loss of GLM is convex.

Hint 1: Use the chain rule of calculus along with the results of the previous parts to simplify your derivations.

Hint 2: Recall that variance of any probability distribution is non-negative. Remark: The main takeaways from this problem are:

❼ Any GLM model is convex in its model parameters.

❼ The exponential family of probability distributions are mathematically nice. Whereas cal-

culating mean and variance of distributions in general involves integrals (hard), surprisingly we can calculate them using derivatives (easy) for exponential family.

5. [25 points] Linear regression: linear in what?

In the ﬁrst two lectures, you have seen how to ﬁt a linear function of the data for the regression problem. In this question, we will see how linear regression can be used to ﬁt non-linear functions of the data using feature maps. We will also explore some of its limitations, for which future lectures will discuss ﬁxes.

(a) [5 points] Learning degree-3 polynomials of the input

Suppose we have a dataset {(x(i), y (i))} where x(i), y (i) e R. We would like to ﬁt a third degree polynomial hθ (x) = θ3 x3 + θ2 x2 + θ 1 x1 + θ0 to the dataset. The key observation here is that the function hθ (x) is still linear in the unknown parameter θ, even though it’s not linear in the input x. This allows us to convert the problem into a linear regression problem as follows.

Let φ : R → R4 be a function that transforms the original input x to a 4-dimensional vector deﬁned as

- 1 -

φ(x) = e R4 (4)

│ x3 '

Let e R4 be a shorthand for φ(x), and let (i) 全 φ(x(i)) be the transformed input in the training dataset. We construct a new dataset {(φ(x(i)), y(i))} = {((i), y (i))} by replacing the original inputs x(i)’s by (i)’s. We see that ﬁtting hθ (x) = θ3 x +θ32 x +θ2 1 x1 + θ0 to the old dataset is equivalent to ﬁtting a linear function hθ () = θ3 3 + θ2 2 + θ1 1 + θ0 to the new dataset because

hθ (x) = θ3 x3 + θ2 x2 + θ1 x1 + θ0 = θ3 φ(x)3 + θ2 φ(x)2 + θ1 φ(x)1 + θ0 = θ T (5)

In other words, we can use linear regression on the new dataset to ﬁnd parameters θ0 , . . . , θ3 . Please write down 1) the objective function J(θ) of the linear regression problem on the new dataset {((i), y (i))} and 2) the update rule of the batch gradient descent algorithm for linear regression on the dataset {((i), y (i))} .

Terminology: In machine learning, φ is often called the feature map which maps the original input x to a new set of variables. To distinguish between these two sets of variables, we will call x the input attributes, and call φ(x) the features. (Unfortunately, diﬀerent authors use diﬀerent terms to describe these two things. In this course, we will do our best to follow the above convention consistently.)

(b) [5 points] Coding question: degree-3 polynomial regression

For this sub-question question, we will use the dataset provided in the following ﬁles: src/featuremaps/{train,valid,test} .csv

Each ﬁle contains two columns: x and y . In the terminology described in the introduction, x is the attribute (in this case one dimensional) and y is the output label.

Using the formulation of the previous sub-question, implement linear regression with nor- mal equations using the feature map of degree-3 polynomials. Use the starter code pro- vided in src/featuremaps/featuremap .py to implement the algorithm.

Create a scatter plot of the training data, and plot the learnt hypothesis as a smooth curve over it. Submit the plot in the writeup as the solution for this problem.

Remark: Suppose is the design matrix of the transformed dataset. You may sometimes

-' x(1) -(

φ(x) = '(') x2 e Rk+1 (6)

' (

│ xk '

Follow the same procedure as the previous sub-question, and implement the algorithm with k = 3, 5, 10, 20. Create a similar plot as in the previous sub-question, and include the hypothesis curves for each value of k with a diﬀerent color. Include a legend in the plot to indicate which color is for which value of k .

Submit the plot in the writeup as the solution for this sub-problem. Observe how the ﬁtting of the training dataset changes as k increases. Brieﬂy comment on your observations in the plot.