闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 3: Gaussian Classiﬁers & Logistic Regression

CS275P: Graphical Models & Statistical Learning

2022

In the ﬁrst two questions, we use multinomial logistic regression models to predict one of K discrete classes. Let tnk = 1 if observation n is an instance of class k, and tnk = 0 otherwise. Given input variable xn , let φ(xn ) e 皿M be a ﬁxed feature function. The multinomial logistic regression model then deﬁnes the conditional probability of class k as follows:

exp(wk(T)φ(xn ))

e|（ exp(we(T)φ(xn )) .

Here wk e 皿M are the feature weights associated with class k, and w e 皿KM is a vector concatenating the weights for all classes.

This model is not identiﬁable: coordinated translations of the weight vectors wk for diﬀerent classes can produce identical probabilities. To avoid this ambiguity and improve robustness, we place a zero-mean, diagonal-covariance Gaussian prior on the weight vector:

p(w) = Ⅳ (w | 0, α 丰（IKM ) x exp ╱ - wT w、.

Here, α is a tunable inverse-variance parameter that controls the degree of regularization. The regularized negative log conditional likelihood, or equivalently the objective whose min- imum equals the MAP estimate of the weight vector w, is then

f (w) = - log p(w) - log p(tn | xn , w)

n|（

= wT w - n - k tnk wk(T)φ(xn ) - log ╱ e exp(we(T)φ(xn ))←! ,

ignoring constants which do not depend on w .

Question 1: (40 points)

a) Give an equation for the partial derivative of the regularized negative log conditional likelihood f (w) with respect to a particular feature weight wkm . This derivative is one entry of the gradient vector Vf (w) .

We reexamine the gamma ray data set from Homework 1, but instead apply a logistic regression model for the binary classiﬁcation of star showers. We use the same split of the data into 15,216 training examples and 3,804 test examples. We compare the performance of three diﬀerent feature mappings of the D = 10 raw inputs. In all cases, the ﬁrst feature should be a constant bias or oﬀset term, φ} (xn ) = 1. The three feature sets are then:

1. D + 1 linear features, the bias feature plus the raw inputs φm (xn ) = xnm , 1 < m < D .

2. 2D + 1 diagonal quadratic features, including the D + 1 linear features from set 1, as well as D quadratic features φD(m(xn ) = xn(3)m , 1 < m < D .

3. (D + 1)D/2 + D + 1 general quadratic features, including the 2D + 1 features from set 2, as well as products of all pairs of input dimensions xnm xnm/ , m m二 .

Given this data and the objective from part (a), we will use the mfnfmfzc function from the safpy.optfmfzc package to ﬁnd the weight vector w that minimizes the regularized negative log conditional likelihood f (w). You should provide the optimizer with the gradient of the regularized negative log conditional likelihood, and write your own function to compute the objective and its gradient. (You may not simply use numerical approximations to the gradient.) To get started with mfnfmfzc, see the sample code released with the homework, which includes suggestions for appropriate convergence tolerance parameters.

For all of the following questions, set the regularization constant to α = 10丰6 , and test each of the three feature sets deﬁned above.

b) For each of the three feature sets, train the logistic regression model by running gradient- based optimization to convergence. Report the accuracy and (regularized) negative log- likelihood of the classiﬁer when training and evaluating on tr-fn .

c) For each of the three feature sets, and the weight vectors w corresponding to the values of α identiﬁed in part (b), evaluate and report the tcst accuracy and tcst (regularized) negative log-likelihood of the corresponding classiﬁers.

d) Compare the test accuracies of the logistic regression models to the Gaussian naive Bayes classiﬁer from Homework 1. Which method is more accurate? Using concepts from lec- ture, brieﬂy discuss possible reasons for the observed performance diﬀerences.

Question 2: (30 points)

This question uses synthetic data to compare the properties of logistic regression and linear regression for classiﬁcation. Each “toy” data item has two continuous features x e 皿3 and is labeled as one of either K = 2 or K = 3 classes.

Linear regression code (as in Homework 2) can be adapted for classiﬁcation as follows. For K classes, each response is encoded as a row vector tn = [tn（, . . . , tnK ], where tnk = 1 for an example of class k, and zero otherwise. For N data samples we deﬁne the N x K matrix T as a matrix of 0’s and 1’s, with each row having a single 1. We ﬁt a linear regression model to each of the columns of T as follows:

= Φ(ΦT Φ)丰（ΦT T

Here, Φ is the N x 3 model matrix of corresponding to the feature function φ(xn ) = [1, xn（, xn3]T , i.e. the raw 2D input data augmented by a constant bias feature. The weights corresponding to the least squares prediction above equal

Wˆ = (ΦT Φ)丰（ΦT T

Here, Wˆ is a 3 x K matrix where the kth column k represents the linear regression ﬁt for class k. Finally, we can use this linear regression model to classify a new observation as

y(x) = arg max φ(x)T k .

The supplied function piottcr ai-ssfefcr can be used to visualize decision boundaries.

We compare the performance of this linear least squares classiﬁer to a multinomial logistic regression classiﬁer, both using the same features. To ﬁt multinomial logistic regression models, use your implementation from Question 1, with a small but positive regularization constant α = 10丰6 to ensure identiﬁability.

a) The ﬁrst dataset contains two classes which lie in well-separated clouds. Implement the least squares classiﬁer described above. Estimate weights from training data, and plot the learned decision boundary together with the training points. If implemented correctly, your test accuracy should be 100%. Is this the case?

b) The second dataset contains three classes with means arranged in a triangular pattern. Train a least squares classiﬁer as above, as well as a multinomial logistic regression clas- siﬁer using the same features. Plot the training decision boundaries for both classiﬁers, and report test accuracy for each. Explain any performance diﬀerences.

c) The third dataset contains three classes with means arranged in a straight line. Train a least squares classiﬁer as above, as well as a multinomial logistic regression classiﬁer using the same features. Plot the training decision boundaries for both classiﬁers, and report test accuracy for each. Explain any performance diﬀerences.

Question 3 (30 points)

We now consider a binary categorization problem, where tn e {0, 1} is the output label for example n, and xn e 皿3 is a two-dimensional vector of input features. Assume that the two classes are equally likely a priori, so that p(tn ) = Ber(tn | 0.5). Under the true data generation process, the features are distributed according to class-speciﬁc Gaussians:

p(xn | tn = 1) = Normal(xn | µ（, Σ), p(xn | tn = 0) = Normal(xn | µ} , Σ).

The mean vectors µ（, µ} are discussed below. The shared covariance matrix equals:

Σ = ┐ = ┐ , Σ 丰（ = ┐ .

a) Suppose that µ} = [0, 0]T , µ（ = [2, 0]T . Given knowledge of the true joint distribution p(xn , tn ), derive a classiﬁcation rule y(xn ) which minimizes the probability of error. Plot the corresponding decision boundary graphically.

b) Suppose that µ} = [0, 0]T , µ（ = [2, 2]T . Given knowledge of the true joint distribution p(xn , tn ), derive a classiﬁcation rule y(xn ) which minimizes the probability of error. Plot the corresponding decision boundary graphically.

Now suppose that we do not have knowledge of the true data generating process, but instead assume a naive Bayes model with Gaussian features:

p(xn | tn = 1) = Normal(xn（ | θ（（, ν（（)Normal(xn3 | θ（3 , ν（3 ), p(xn | tn = 0) = Normal(xn（ | θ}（, ν}（)Normal(xn3 | θ}3 , ν}3 ).

Consider a training dataset with N observations (xn , tn ) independently sampled from the true joint distribution p(x, t). In each question below, assume that the parameters θ and ν of the naive Bayes model are estimated via the maximum likelihood (ML) criterion.

c) Suppose that µ} = [0, 0]T , µ（ = [2, 0]T . As N → o, what classiﬁcation rule will the naive Bayes classiﬁer approach? Will it be as accurate as the optimal rule from part (a)? Justify your answer and plot the corresponding decision boundary graphically.

d) Suppose that µ} = [0, 0]T , µ（ = [2, 2]T . As N → o, what classiﬁcation rule will the naive Bayes classiﬁer approach? Will it be as accurate as the optimal rule from part (b)? Justify your answer and plot the corresponding decision boundary graphically.