闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Semester 1, 2023

Assignment 1

COMP4670/8600: Statistical Machine Learning

Release Date. 27th Feburary 2023.

Due Date. 27th March 2023 at 1200 AEDT.

Maximum credit. 100 Marks for COMP4670 and 120 Marks for COMP8600.

For submission, we are using Ed. Check under the submission tab and make sure to follow instructions on what to submit. Although using Ed for submission, we recommend that you work on the assignment locally on your computer as we will not be providing display .ipynb on Ed.

If you want to submit in LATEX, we have a general template which you can use: https://quicklink.anu.edu.au/a4me.

In addition to the ﬁles which you need to submit, we have provided an display .ipynb Jupter Notebook which you can use to debug the functions that you implement. A correct implementation should run the

notebook from start to ﬁnish without any errors. DO NOT change any ﬁles in framework/ .. Do not submit:

● The data/ . folder;

● The framework/ . folder;

● The Jupyter Notebook display .ipynb.

Grading is diﬀerent for COMP4670 and COMP8600 students. Grades will be calculated as a percentage out of 100 or 120. Questions which are required for COMP8600 students are marked in their titles. These are optional for COMP4670 students. For COMP4670, their ﬁnal grade will be taken as the maximum percentage of only COMP4670 questions vs all questions.

Installation can be done by using requirements .txt provided. For a fresh install, create a new Python environment (e .g ., via Conda) and then run pip install -r requirements .txt in said environment:

$ conda create -n sml

$ conda activate sml

$ pip install -r requirements .txt

Collaboration, you are free to discuss the material in the assignment with your classmates. However, every proof and code submitted must be written by yourself without assistant. You should be able to explain your answers if requested.

Regrade requests will be processed via a Microsoft Form. This will be released closer to the due date. Other notes:

● When writing proofs, use the equation numbers when referring the equations in the assignment, i. e ., “Through Equation (3.1) we can show ...”;

● You are not required to complete the assignment linearly, you may want to solve which-ever ques- tions you can regardless of order of appearance;

● If you have trouble with proving a statement, try reducing dimensionality of the problem ﬁrst;

● Code is only graded on correctness of functions implemented, not the output of display .ipynb.

Section 1: Bayesian Thinking (5 Total Points)

A historic grouping of machine learning methods comes in the form of non-Bayesian vs. Bayesian meth- ods. You will notice that the ﬁrst half of the course (and much of the Bishop’s textbook) follows a pattern of introducing a non-Bayesian model / algorithm and subsequently extendeding it to a Bayesian counter-part, e .g ., Logistic Regression to Bayesian Logistic Regression; or Kernel Regression to Gaussian Processes. Although much of the course focuses on mathematical detail, for this ﬁrst question we will consider some of the “whys” of using Bayesian methods in the ﬁrst place.

Figure 1: From http://yann.lecun.com/ex/fun/index.html#allyourbayes.

Let’s ﬁrst refresh on the application of Bayes’ Rule to models and data in ML. Given a training dataset D and a model θ, the posterior distribution is deﬁned as

“likelihood” “prior”

4尸 4尸

尸 (尸 4 Pr(D)

“posterior” 尸 (尸 4

“data”

On the right-hand side, Pr(D) is ignored and interpreted as a normalization constant for the posterior. Bayes’ Rule incorporates some amount of prior knowledge into a model and we can consider using maxi- mizing the posterior rather than the likelihood in parameter ﬁtting. With this interpretation in mind, we should think about when a non-Bayesian method and a Bayesian method are “equivalent.”

Question 1.1: Am I Bayes?

Consider using two methods—maximum likelihood (MLE, non-Bayesian) and maximum a pos- teriori (MAP, Bayesian)—to estimate the parameters of some probability distribution from the same dataset and parameter space. When would the resulting estimates be equal?

Finding the MAP estimate consists in solving the optimization problem

max Pr(θ | D), (1.2)

二eΘ

where Θ is the set of all possible values for θ . Sometimes, ﬁnding or approximating the solution to Problem (1.2) is straightforward. Unfortunately, most of the time it is not. For the rest of this question, we will make the following assumption.

Assumption: For the problem max二eΘ Pr(θ | D), we assume there exists no method for ﬁnding or numerically approximating (with high enough precision) its optimal solution. However, we have access to N > 0 i.i.d. samples {θi ~ Pr(. | D)} and their corresponding probabilities Pr(θi | D), Ai = 1, . . . , N .

You might notice from the Bayesian Regression lecture that, by the properties of Gaussian distributions,

we can ﬁnd the MAP estimator for θ if sampling from the posterior Pr(θ | D) is possible. That is, assuming that (i) the posterior is Gaussian (*) and (ii) we have N i.i.d. samples from Pr(θ | D), we have

arg max Pr(θ | D) = z [θ] s θi .

i=1 (1.3)

The sample mean on the right-hand side provides a convenient method of approximating the “best” model.

Now consider Pr(θ | D) is not Gaussian. A tantalizing question emerges: When is it suitable to approxi- mate the MAP estimator using a sample mean?

Question 1.2: Approximating the MAP (2 Points) Consider a posterior distribution Pr(θ | D) supported on Θ and N i.i.d. samples {θi e Θ} from it. Deﬁne the sample mean as θi .

θi to approximate

distribution for which

Even in the general case where (*) does not hold, with access to both the samples θi and probabil- ities Pr(θi | D), we can approximate the optimal solution simply by ﬁnding the sample maximum: arg maxi=1之]]]之N Pr(θi | D).

... But should we?

One topic in machine learning which has recently garnered attention is conformal prediction.1 The rise in the method’s popularity is thanks to its ability to characterize the uncertainty of predictions through a conﬁdence interval. Of course, a Bayesian model should naturally allow us to quantify the uncertainty of a “best” model as we are working with distributions.

Question 1.3: Being Uncertain

For the sample mean N(1)

θi and sample maximum arg maxi=1之]]]之N Pr(θi | D), discuss how we might (or might not be able to) quantify the uncertainty of the parameter estimate θ given a ﬁxed set of N samples. In other words, is it possible to establish “error bars” for these estimates?

Remark. On a more philosophical note, given a posterior Pr(θ | D), we may want think about what

constitutes a “best” model θ 农 . Do we want one that maximizes the posterior, one equal to the average model, or maybe something else like the median? In diﬀerent scenarios, this deﬁnition of “best” can be radically diﬀerent.

Section 2: Exponential Families (45/65 Total Points) For this question, we will explore a uniﬁed family of distributions: the exponential family (not to be confused with an exponential distribution). The exponential family provides a generalization many com-

monly used distribution (Gaussian distribution, Beta distributions, Binomial distributions, etc .) and has many interesting properties which are used throughout ML.

First, let us deﬁne the exponential family distributions we will be considering in this question.

Deﬁnition 1 (Exponential Family2). Given a function n : R。 → Rm , we denote an n-dimensional expo- nential family distribution as Exp(n, u), where u e p c Rm designates the m-dimensional parameters of the distribution within an exponential family3. The corresponding densities of the distributions are given

q(X | u) = exp(uT n(X) _ ψ(u)), (2.1)

where

ψ(u) = log exp(uT n(X)) dX . (2.2)

The function n is called the suﬃcient statistics of the exponential family and the function ψ is called

the log-partition function of the exponential family. Sometimes, it will be convenient to deﬁne Z(u) = Rn exp(uT n(X)) dX = exp(ψ(u)) as the normalizer of the distribution.

Remark. We note that the deﬁnite integral in Eq. (2.2) is over the domain of X . In this assignment, the domain of integration will simply be over R。; which can be simpliﬁed to integrating over (_o, o) for diﬀerent coordinates.

Remark. Notice that Deﬁnition 1 represent a broad set of special cases of the deﬁnition given in the Bishop textbook Eq. (2.194), with h(x) = 1 and g(u) = exp(_ψ(u)). For notational convenience in the tasks speciﬁed below, we use this deﬁnition.

At this point, it is useful to verify that the Exponential family indeed generalizes distributions we are familiar with.

Question 2.1: Gaussian R.V. as an Exponential Family (10 Points)

Let n(z) = (z, z2 ) and u = (µ/σ2 , _1/2σ ) deﬁne an (n2 = 1)-dimensional exponential family distribution Exp(n, u) over R.

Verify that Exp(n, u) is equivalent to the 1-dimensional Gaussian distribution with mean µ and variance σ 2 .

What is the distribution’s normalizer Z(u)?

One interesting aspect of Exponential families is its connections to loss functions in ML one typically ﬁnds. The simplest connection comes from ‘equating’ a loss function minimization procedure to the MLE of an exponential family, i. e .,

arg min “A loss function e” = arg max log q(X | u). (2.3)

Of course, to do this, one needs to ﬁnd the correct matching between losses and exponential family. To do so, let us introduce some data.

Assumption: Let us assume that we have data D = {(xi , yi )} with features xi e R。 and corresponding labels / targets yi e R.

Question 2.2: Exponential Families and Losses (5 Points) Let f (x; w) = wT x with weights w e R。. Let us deﬁne a conditional random variable using a 1-dimensional exponential family, such that

(2.4)

where u(y) = (y, y2 ) and η = η(x) = (f (x; w), _1/2).

That is, as per Deﬁnition 1, we are deﬁning the following conditional p.d.f. such that:

p(y | x; w) = q(y | η(x)) = exp(ηT (x)u(y) _ ψ(η(x))).

Show that

N N

arg max log p(yi | xi ; w) = arg min (yi _ f (xi ; w))2 .

peRn

Remark. One might notice that in Question 2.2, the function f and its corresponding weights w does not eﬀect the maximization / minimization. Thus such an observation can be identically made when consider f as a neural network whilst optimizing over weights W1 , . . ..

In the general case, optimizing either Eq. (2.6) analytically (‘by hand’) can be diﬃcult, e .g ., when f is a neural network. As such, it is useful to consider the gradients of the log-likelihood function.

Question 2.3: Parameterizing an Exponential Families (15 Points) Let f (x; θ) be some mapping from feature x parameterized by θ . Let us deﬁne a conditional random variable using an exponential family, such that

y | x ~ Exp(u, η), (2.7)

where η = f (x; θ) for any function u : R。 → Rm (we hide the dependence of x in η to reduce notation). We also deﬁne p(y | x; θ) similar to Eq. (2.5).

Show that

1. Show that

∂ψ(η) R

R = z [u(y)].

R9 =f (m;u)

2. Show that

= ╱u(y) _ p(|m)[u(y\ )]、T ╱ 、 .

Eq. (2.9) allows us to use gradient ascent (confer, gradient descent) to iteratively update the parameters of our model. Of course, we can practically use Eq. (2.9) by using auto-diﬀerentiation packages in Python.

However, one will need to do some tricks to make the computation eﬃcient.

Question 2.4: Implementing Gradient Ascent

In ‘exponential_family.py ’ complete the following functions:

● grad_psi_wrt_eta (Eq. (2.8));

● log_likelihood_grad (Eq. (2.9));

● update;

and thus train a neural network in display .ipynb for the MNIST dataset. You will need to use from torch .autograd .functional import jacobian, vjp.

Hint: You may want take a look at framework/exp_fam_model .py (or the copied comment in exponential_family.py).

You will not be graded on optimality of the neural network’s performance, only the correctness of implemented functions.

In the next few questions, we will be considering basic fairness properties of the exponential family distributions. There are many diﬀerent criteria for fairness in ML, and rightfully so as fairness needs to take into account the surrounding social context. We will be considering a simple form of data fairness, where we will measure the representation of diﬀerent sensitive groups (e .g ., gender, age, race, etc .).

In the sequel, we will consider n-dimensional distribution, where the ﬁrst (n _ 1)-dimensions consists of non-sensitive features x e x (those which do not aﬀects fairness) and the ﬁnal dimension consisting of a sensitive feature s e s .

Assumption: We will assume in the following questions that s is a ﬁnite domain.

Deﬁnition 2. Given a p.d.f. p(x, s), its representation rate is given by:

RR(p) = min RR(p, s, s\ ) e [0, 1], (2.10)

s之s\ es

where RR(p, s, s\ ) = p(s)/p(s\ ) (with p(s) as the marginal distribution of p(x, s)).

Representation rate simply measure how well the balance of sensitive features are being represented by p(x, s).

Question 2.5 [COMP8600]: Verifying Representation Rate (5 Points)

We have maximal fairness when RR(p) = 1 and minimal fairness when RR(p) = 0.

Argue why this deﬁnition of fairness make sense in terms of marginal probabilities of demographic information p(s).

Thus to study the representation rate of (separable) exponential families, we should consider what the

sensitive marginal distributions look like.

Question 2.6 [COMP8600]: Sensitive Marginal of Exponential Families (10 Points)

Suppose that we have a separable Exponential Family n-distribution such that the natural pa- rameters η e Rm and suﬃcient statistics u : x x s → Rm are separable as:

and η 1 e Rm1 ;

and u1 : x x s → Rm1 ;

such that its p.d.f. q(x, s) is given by

exp(ηT u(x, s))

Z(η) .

Prove that

q(s) = q0 (s) . . q0 |s)[exp(η1(T)u1 (x, s))],

where q0 and Z0 (η0 ) are deﬁned via Exp(u0 , η0 ).

Now we can use this result to prove a bound on the fairness of a (separable) exponential family.

Question 2.7 [COMP8600]: Representation Rate of Exponential Families

Given the same setup as Question 2.6, prove that

RR(q) > RR(q0 ) . minses zq0 (m|s)[exp(η1(T)u1 (x, s))]

maxses zq0 (m|s) [exp(η1(T)u1 (x, s))] .

Section 3: Classiﬁcation with Rejection (50 Total Points) In the majority of the course, we will be learning about how do we teach an algorithm to make a certain prediction. However, should we always make an algorithm make prediction? The Classiﬁcation with Rejection (CwR) framework allows for an algorithm to ‘abstain’ from making a prediction. But before that, we will ﬁrst review some basic in loss function theory. In classiﬁcation, in essence our goal is to simple assign the correct labels to a set of input features. This ‘correctness’ can be simply quantiﬁed by the 0- 1- loss function: given an input x e x and a label y e y = {1 . . . K}, the correctness

of a hypothesis f : x → y can be evaluated by:

e01 (f (x), y) = ,0(1)

if f (x) y

otherwise

(3.1)

One can verify, that this loss function is simply assigning incorrect prediction a penalty of 1.

As per the lectures, we will be consider the Risk Minimization of these losses. In particular, given a distribution of data points p(x, y) we will denote:

R01 [f] = z[e01 (f (x), y)]. (3.2)

Here we note that we use the subscript to designate what type of loss we are calculating the risk for.

To modify the typical notion of classiﬁcation to add in the option of rejection, we consider a modiﬁed version of the 0-1-loss function. First, we modify the class of functions we are interested in to those that can predict an additional rejection class, denoted as ⑩ . That is, we consider functions f : x → y n {⑩}. This leads us to the 0- 1- c- loss function, a modiﬁed 0-1-loss function which accounts for rejection:

e01e(f (x), y) = ,e(c)01 (f (x), y)

if f (x) = ⑩

otherwise ,

(3.3)

where c e [0, 1/2). We deﬁne the corresponding risk with “01c” subscripts, i. e ., R01e[f].

We will refer to this approach of learning a classiﬁer with rejection as the general approach, i. e ., learning functions of the type f : x → y n {⑩}.

We note that although the classiﬁer f now has the option to reject, the data we are evaluating the loss function is still the same.

Question 3.1: Verify Preferable Rejection (5 Points) Verify that setting c e [0, 1/2) in e01e ensures that choosing a rejection is preferable if there is less than 1/2 chance of f being correct.

Hint: Fix an input x and consider two classiﬁers, one which reject and on which does not. How do the expected losses compare?

Assumption: In sequel, we will assume c e [0, 1/2).

One question we can ask, is what is the optimal classiﬁer in the CwR setting, given perfect information.

Question 3.2: Optimal General Rejection Classiﬁer

Given the true distribution of data p(x, y), with class posterior function

ηy (x) p(y | x) for y e y .

Prove that

f农 = arg min R01e[f],

f农 (x) =

Hint: Consider the best response to a ﬁxed x. When is a ⑩ the best response? Otherwise, what is best response?

More practically, obtaining the optimal CwR classiﬁer diﬃcult. As a result, a number of diﬀerent methods of altering the learning task has been consider. One of these is called the classiﬁer-rejector approach. The intuition of this approach is simple: we learn two separate classiﬁers where one determines rejection; while the other is a regular classiﬁer,

fer (x; h, r) = ,x)

if r(x) < 0

otherwise ,

(3.7)

where r : x → R and h : x → y . The learned function r(x) acts as a proxy for the ‘conﬁdence’ of h(x)’s prediction. One can verify that

e01e(fer (x; h, r), y) = e01er (h(x), r(x), y), (3.8)

e01er (h(x), r(x), y) = ,e(c)01 (h(x), y)

if r(x) < 0

otherwise.

(3.9)

Question 3.3: Optimal Classiﬁer-Rejection Approach

Given

h农 (x) = arg max ηy (x);

yey

r农 (x) = max ηy (x) _ (1 _ c).

yey

For the corresponding risk R01er [h农 , r农] z[e01er(h农 (x), r农 (x), y)], prove that

R01er [h农 , r农] = R01e [f农],

thus proving that h农 , r农 are optimal.

When would Eq. (3.12) hold for arbitrary h, r, f?

From Question 3.2 and 3.3, we have shown that there are two diﬀerent ways of minimizing e01e: a general method, and one which requires two learned classiﬁers. There are many pros and cons for why one would pick between these two. We will explore the sample complexity of calculating the risks. First, let us deﬁne the empirical risk of each of the losses:

Rˆ01e[f] = z e01e(f (xi ), yi )