Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework Assignment 1

Math 156, Summer 2022

Problem 1 (30 points): Weight-decay Regularization and Regular- ization by Adding the Input Noise


D

y(xw) = w0 +  wi xi

i=1

       (1)

together with a sum-of-squares error function of the form

N

ED (w) =       {y(xn , w) − tn }2 .

n=1

(2)


Now suppose that Gaussian noise ϵi  with zero mean and variance σ 2  is added independently to each of the input variables xi . By making use of E[ϵi] = 0 and E [ϵi ϵj ] = δij σ 2 , show that minimizing ED  averaged over the noise distribution is equivalent to minimizing the sum-of- squares error for noise-free input variables with the addition of a weight-decay regularization term, in which the bias parameter w0  is omitted from the regularizer.

Problem 2 (30 points): Multiple Outputs

Consider a linear basis function regression model for a multivariate target variable t having

a Gaussian distribution of the form

p(t|W, Σ) = N(t|y(x, W), Σ)                                         (3)

where

y(x, W) = Wϕ(x)                                                 (4)

together with a training dataset comprising input basis vectors ϕ(xn ) and corresponding target vectors tn  with n = 1, . . . , N . Show that the maximum likelihood solution WML  for the parameter matrix W has the property that each column is given by an expression of the form wML  = ( ΦΦ)1Φt, which was the solution for an isotropic noise distribution (see Section 3.1.5 on page 146 in the Bishop’s book, Pattern Recognition and Machine Learning).

Note that this is independent of the covariance matrix Σ . Show that the maximum likelihood solution for Σ is given by

N

Σ =         tWM(⊤)L ϕ(xn )   tWM(⊤)L ϕ(xn )  .

n=1

(5)

Problem 3 (40 points): Probabilistic Generative Classification Model for K Classes

(i) (20 points) Consider a probabilistic generative classification model for K classes defined by prior class probabilities p(Ck ) = πk  and general class-conditional densities p(ϕ|Ck ) where ϕ is the input feature vector.  Suppose we are given a training dataset {ϕn , tn } where n = 1, . . . , N , and tn  is a binary target vector of length K that uses the 1-of-K coding scheme, so that it has components tnj   = Ijk  if pattern n is from class Ck   (Ijk  = 1 if j = k and 0 otherwise). Assuming that the data points are drawn independently from this model, show that the maximum-likelihood solution for the prior probabilities is given by

πk  =

(6)

where Nk  is the number of data points assigned to class Ck .

(ii) (20 points) Consider the classification model of Problem (i) above and now suppose that the class-conditional densities are given by Gaussian distributions with a shared covariance matrix, so that

p(ϕ|Ck ) = N(ϕ|µk , Σ).                                                   (7)

Show that the maximum likelihood solution for the mean of the Gaussian distribution for

class Ck  is given by

N

µk  =  1  X tnk ϕn                                                                                   (8)

which represents the mean of those feature vectors assigned to class Ck . Similarly, show that the maximum likelihood solution for the shared covariance matrix is given by

K

Σ =XSk

k=1

where

N

Sk  =  1  X tnk (ϕn µk )(ϕn µk ).

(10)

Thus Σ is given by a weighted average of the covariances of the data associated with each class, in which the weighting coefficients are given by the prior probabilities of the classes.