闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 1

CSE 4820/5819, Spring 2023

Released: January 31, 2023

Due: February 14, 2023

Note: You may use any programming language and plotting software to complete Ex- ercise 2. However, Python is recommended. Please submit your code for full credit (answers and plots without code will only receive half credit).

Exercise 1 [40 points]. Mathematical Background

1. (5 points) Let x e Rn and A, B e Rn xn . What is an equivalent expression to (xT AB)T ?

2. (5 points) Consider the set of points }x e R2 : lxlp = 1I.

(a) Plot this set of points for p = 2. This is the unit sphere in L2 .

(b) On the same ﬁgure, plot the unit sphere in L1 .

(d) What do you notice about the unit spheres as p increases from 1 to o?

3. (5 points) Let x, y e Rn .

(a) Express lx - yl2(2) using the dot product.

(b) If you have not already done so in part (a), simplify your answer above as the

sum of three terms.

4. (20 points) Consider the multivariate quadratic function f (x) = xT –0(1) 1(4)! x + – !2(1) T x + 1

(a) What is the gradient Vf (x)?

(b) What is the Hessian V2 f (x)?

5. (5 points) Suppose we are solving a minimization problem. In class, we discussed two diﬀerent examples of a type of iterative optimization algorithm called m之n| s|步+|| ,|i|← 。|s: (1) steepest descent (gradient descent), and (2) Newton’s method. In gradient descent, at every iteration n we need to calculate the gradient Vf evaluated at xn in order to compute the next guess xn+1 = xn - αVf(xn ). When using Newton’s method to minimize a function f, at every iteration in addition to Vf(xn ) what additional information about f at xn do we need to calculate in order to compute xn+1? What are the advantages and disadvantages of using this extra information?

Exercise 2 [60 points]. Logistic Regression

1. (10 points) Recall that in logistic regression, we have the model P (y = 1|x; θ) = g(θT x) =

(a) Suppose x, θ e R. Plot hθ (x) = g(θT x) for θ = 1, 2, and 5 on the same plot. How does the shape of g change as θ increases? What is the shape of hθ (x) = g(θT x) in the limit θ → o?

(b) When the training data are linearly separable (there exists a line/hyperplane that

perfectly separates the labeled data into diﬀerent classes), we will maximize the likelihood of the data when θ is chosen so that any positive distance from the decision boundary corresponds to P (y = 1|x; θ) = 1 and any negative distance corresponds to P (y = 1|x; θ) = 0. Another way to say this is that we want to ﬁnd θ such that g(θT x) = 1 whenever θT x > 0, and g(θT x) = 0 whenever θT x < 0. What θ gives this model? (Hint: see plots above.)

● Note this shows logistic regression does not converge when the training data are linearly separable. In practice, data are rarely lin- early separable, but there are also diﬀerent ways to modify logistic regression to take the separable case into account, which we will return to when we discuss regularization.

2. (20 points) Recall that for m datapoints (x(i), y(i)), where x e Rn is the feature vector, y e }0, 1I is the class label, and i = 1, ..., m is the index of the datapoint, the log likelihood for logistic regression is

l(θ) = y(i) log hθ (x(i)) + (1 - y(i)) log(1 - hθ (x(i)))

i=1

where hθ (x(i)) = g(θT x(i)).

(a) Let m = 1 so that we consider only the log likelihood of the single datapoint

(x, y). Use the known fact that g\ (z) = g(z)(1 - g(z)) to show that

∂

Note that this gives us the jth element of the gradient Vθ l.

(b) What is the expression for the gradient for arbitrary m?

3. (30 points) Download the ﬁles xtaal .dat and ytaal .dat included in the homework. The ﬁrst contains the features and the second contains the labels for 99 datapoints.

(a) Save the data from xtaal .dat in a matrix, X . Each row of xtaal .dat corre-

sponds to a datapoint, and each column corresponds to a feature. In this dataset, how many features does each datapoint have?

(b) Something we did not go into detail in class is that in general, the decision bound-

ary separating the two classes does not pass through the origin. Therefore, the linear model θT x = 0 for the decision boundary actually implicitly includes an intercept term θ0 . Let [x1 , x2] be the feature vector describing a given datapoint. Then in logistic regression, we add a constant feature 1 to the feature vector in order to get an intercept term:

θT x = [θ0 θ 1 θ2] ┌x(1)1 ┐

Add this constant feature to your matrix X .

(c) Use your answer from part 2(a) above to write code for ﬁnding θ * that maximizes the log likelihood using gradient ascent. Start from an initial guess of θ = 0 (the vector of all zeros), and stop when lVθ ll2 < 10-6 (the tolerance we’re using to approximate the condition Vθ l = 0).

i. Use a learning rate of α = 0.01. How many steps did you need to take?

ii. Now try using α = 0.1. What happens?

iii. How many steps do you need for convergence when α = 0.001?

iv. What do you conclude about how the learning rate aﬀects convergence?

(d) Plot the data, using diﬀerent colors for each label, and the decision boundary that you found in part 3(c) above. Recall the decision boundary is the line where θ0(*) + θ1(*)x1 + θ2(*)x2 = 0.

(e) Now let’s make a prediction on a new datapoint. According to your trained model in part 3(c) above, what is P (y = 1|x = [2 1]T ; θ* )? On the other hand, what is P (y = 0|x = [2 1]T ; θ* )? Therefore, what class label y do we predict for x = [2 1]T ?

2023-02-18

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言