闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

J3AS 30553 Level H

Applied Statistics (J-BJI)

Mathematics Programmes

J-BJI Semester 2 Examinations 2021-22

1. (a) Explain a similarity and a difference between classiﬁcation and regression problems. Use one or two sentences for each case. [2]

(b) The boiling point of water (in degrees Fahrenheit) and the barometric pressure (in inches

of mercury) at 8 locations in the Swiss Alps were measured and stored in the dataset forbes in R. A researcher makes two attempts to apply simple linear regression on this dataset (lm1 and lm2) with the aim of predicting pressure from the boiling point (so that the altitude can be estimated). Here is some extract from their R output, where bp gives the boiling point and pres atmospheric pressure.

> forbes

bp pres

1 204 25

2 198 22

3 212 30

4 201 24

5 205 27

6 211 29

7 200 23

8 199 23

> lm1 <- lm(bp ⇠ pres, data = forbes)

> lm2 <- lm(pres ⇠ bp, data = forbes)

gation and why? [1]

(ii) Proceeding with the most sensible of the two models, the mathematical presentation

of the model is given by, yi = β0 +β1xi+εi , i = 1, 2, , 8. Clearly deﬁne all the no-

tation in the equation, state any underlining assumptions, and identify the unknown parameters. [3]

(iii) The researcher computed the ordinary least squares estimate of β = (β0 , β1 )T ,

by minimising the residual sum of squares (RSS), RSS(β ) = (y − Xβ )T (y − Xβ ).

Specify the design matrix X and the output vector y of the forbes data. Find the

least squares estimate β(ˆ) using X andy. [5]

(iv) Explain in words what the slope estimate represents. [2]

(v) The researcher ﬁts the most sensible of the two models (named lmfinal) in R; an edited version of their output appears below. Test the hypothesis H0: β1 = 0 vs H1: β1 0. Write down the numerical value of the corresponding t-statistic and the relevant degrees of freedom. Based on the corresponding p-value do you accept H0 or not?

> summary(lmfinal)

Call:

lm(formula = y ⇠ x, data = forbes)

Coefficients:

Estimate Std . Error t value Pr(>|t|)

(Intercept) 7.21182 2.01e-05 ***

x 0.03538 4.48e-06 ***

---

Signif.codes:0 ‘ *** ’ 0.001 ‘ ** ’ 0.01 ‘ * ’ 0.05 ‘.’ 0.1 ‘ ’ 1 (vi) Calculate a 95% conﬁdence interval for β1 . Is that what you expected given the

results of the hypothesis test? Explain why. Some additional code and output from R is given below to help you with this task.

> qt(0.95,6)

[1] 1.94318

> qt(0.975,6)

[1] 2.446912

> qt(0.99,6)

[1] 3.142668

> qt(0.995,6)

[1] 3.707428

(c) Let (x1 ,y1), (x2 ,y2), . . . , (xn ,yn) represent n independent observation pairs, where n ≥ 3. A student consider two linear regression models:

Model 1: yi = a0 + a1xi+ εi , i = 1, 2, . . .n,

Model 2: xi = β0 +β1yi+ εi , i = 1, 2, . . .n,

where εi’s are independent and identically distributed random variables, each having a N(0, s2 ) distribution.

The student successfully ﬁtted each model to the same set of observation pairs {(xi ,yi)}1

using the least squares method. Suppose that µy = 2µx and σy(2) = 4σx(2), where µx and σx(2)

1 and µy and σy(2) are the sample mean and

1 .

least squares estimate 1 of a1 . [3]

(ii) What is the least squares estimate β(ˆ)0 of β0 ? Express your solution in terms of the

least squares estimate 0 of a0 . [2]

(d) Suppose we have a linear regression problem with P features and n observations. We estimate the coefﬁcients in the linear regression model by minimizing the RSS for the ﬁrst

p features:

yi− β0 − βjxij!2 ,

where p P.

(i) As we increase p from 1 to P, describe how the variance of the statistical learning method will typically change with p. Brieﬂy justify your answer.

(ii) As we increase p from 1 to P, describe how the test RSS will typically change with p. Brieﬂy justify your answer.

2. (a) A medical test for asthma is the peak ﬂow test and is measured by a patient blowing as hard as they can into a small handheld device called a peak ﬂow meter. The peak ﬂow test score for each patient is displayed on the side of their peak ﬂow meter and is given in litres of air breathed out per minute (l/min). Suppose that two researchers are hired by ABC clinic to help predicting whether a child has asthma or not, “yes” or “no” respectively, based on their peak ﬂow test score, X. From previous studies made on the disease, they know that the distribution of test score in children without asthma is normally distributed with mean 180 (l/min) and standard deviation of 18 (l/min). In addition, the distribution of test score in children with asthma is normally distributed with mean 130 (l/min) and standard deviation of 14 (l/min). Finally, the prevalence of asthma in children is 10%, i.e. 10% of the children population have asthma.

(i) Based on these ﬁndings, the ﬁrst researcher decides to use Quadratic Discriminant Analysis (QDA) to classify the children in terms of the output variable Y E {yes, no}. What is the probability that the child A has asthma given that they get a score of 145 (l/min) when they blow in the peak ﬂow meter? Provide your answer in 2 decimal places.

(ii) Using the QDA classiﬁer proposed by the ﬁrst researcher, provide an expression for the decision boundary given that the classiﬁcation rule is to classify to the class with the highest posterior probability. Show your work, starting from the posterior probabilities. State clearly the regions where the classiﬁer will predict “yes” and where it will predict “no” . What is the predicted output of the child A? Hint: The formula for the roots of a quadratic equation in the form of ax2 +bx+c = 0, where a, b, c are constants with a 0, is given by x = .

(iii) The second researcher decides to use an alternative distribution for the test scores of children. More speciﬁcally, they assume that the distribution of test score inchil- dren with asthma is exponentially distributed with mean 130 (l/min), while the distri- bution for those without asthma is exponentially distributed with mean 180 (l/min). The probability density function of an exponential distribution with mean 1/λ is given by,

f(x)=〈0(λ),e if(-λ)x(x) ,0(f) ≥ 0,

can be written in the form s(g (x)), where s is the logistic function and g is another

function. [3]

(iv) Use the equation derived in part 2(a)(iii) to calculate the probability that the child A

has asthma given that they get a score of 145 (l/min) when they blow in the peak

ﬂow meter. Provide your answer in 2 decimal places. [1]

(v) Using the classiﬁer proposed by the second researcher, provide an expression for

the decision boundary given that the classiﬁcation rule is to classify to the class

with the highest posterior probability. As before, show your work starting from the

posterior probabilities and state clearly the regions where the classiﬁer will predict

“yes” and where it will predict “no”. What is the predicted output of the child A? [4]

(vi) Comment on the usefulness of the classiﬁer proposed by the second researcher. [2]

(b) The table below shows a training data set with n = 6 observations of two inputs, x1 and x2 , and one qualitative output y, with y 2 {0, 1}:

i	1	2	3	4	5	6
x1	−2	2	0	−4	0	4
x2	−1	−1	2	4	−6	4
y	1	1	1	0	0	0

where iis the data point index.

(i) Illustrate the data points in a graph with x1 and x2 on the two axes. Represent the points belonging to classy = 0 with a circle and those belonging to classy = 1 with a cross. Further, annotate the data points with their data point indices.

(ii) A student has trained four different classiﬁers with the above dataset using (I) logis- tic regression (II) quadratic discriminant analysis (QDA) (III) K-nearest neighbour (K-NN) with K = 1 and (IV) K-NN with K = 3. For the K-NN classiﬁer, the stu- dent used the Euclidean distance. She computed the misclassiﬁcation error on the training data for each of the four classiﬁers and got the number 50% for one clas- siﬁer, 33% for one other classiﬁer and 0% for the remaining two classiﬁers. Which misclassiﬁcation error corresponds to which classiﬁer? Hint: For some classiﬁers a short argument is enough and you don’t need to calculate the exact misclassiﬁcation

error.

3. (a) In this question we consider data of the generic form (xi(T) ,yi) 2Rp ⇥R, i = 1, . . . , n, i.e. we have n observations and each observation (xi(T) ,yi) is comprised of a Rp-valued predictor xi=(xi1 ,xi2 , . . . ,xip)T 2 Rp, and a scalar-valued response yi 2 R.

We are modeling this data using a linear regression of the form

yi = β0 + βjxij+ εi , εi ⇠ N(0, s2 ), i = 1, . . . , n,

where the error terms are assumed to be independent and normally distributed with stan- dard deviation s > 0.

Consider the following estimates of the regression coefﬁcients β =(β0 , . . . , βp )T .

β A:= arg β(mi)n yi− β0 − βjxij) , (Method A)

B:= arg β(mi)n (yi− β0 − βjxij)2 +λ βj(2)} , (Method B)

C:= arg β(mi)n (yi− β0 − βjxij)2 +λ 'βj' } . (Method C)

(ii) What basic condition should the scalar λ satisfy in order for Method B and Method C to be sensible regression methods? [2]

(iii) If n = 100 and p = 200, which of the three methods would you advise to not use. [2]

(iv) Provide an explicit expression of A in terms of the corresponding response vector y and design matrix X. [2]

(v) Consider a dataset with n = 100 and p = 6. Each of the regression methods was applied to this dataset. For Method B and Method C the regularization parameter was set to λ = 1.

(1) Between Method A and Method B which method do you expect to have higher bias?

(2) Between Method A and Method B which method do you expect to have higher variance?

(3) The following estimates of the regression coefﬁcients were obtained

• (2.01, −1.95, −0.42, 5.53, 0.30, 0.52, 5. 17)

• (1.99, −2.64, −0.69, 8.33, 0.11, 2.09, 7.82)

• (2.03, 0 , 0 , 3.31, 0 , 0 , 2. 148)

Match each estimate to the corresponding regression method. Justify your guess!

(4) The following table shows the estimates of the test mean square error obtained by 10-fold Cross-Valdiation (denoted as CV(10)) for each method.

	CV(10)
Method A	1.90
Method B	1.06
Method C	1.23

If we aim at obtaining a model that generalizes well to new observations, does

the value of CV(10) provide information on which method out of Method A, Method B and Method C we should use to parametrize the linear regression model? If yes, which method would you select? Otherwise, explain why CV(10) is not a useful quantity in this context.

(5) The following table shows the estimates of the Aitken Information Criteria (AIC) for each method.

	AIC
Method A	108.04
Method B	222.17
Method C	602.08

If we again aim at obtaining a model that generalizes well to new observations, does the value of the AIC provide information on which method out of Method A, Method B and Method C we should use to parametrize the linear regression model? If yes, which method would you select? Otherwise, explain why AIC is not a useful quantity in this context.

(vi) Let p = 200 and assume that the true underlying model of the data is of the form [3]

yi = β0 + βjxij+ εi , i = 1, . . . , n,