Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Introduction to Data Science and Statistical Modelling Assignment

Please make sure that the submitted work is your own. This is NOT a group assignment, therefore approaches, solutions shouldn’t be discussed with other students. Plagiarism and collusion with other students are examples of academic misconduct and will be reported. More information on academic honesty can be found here.

The submitted work should be your own work! The questions apart from Q1(c), Q1(d), Q3(c), Q3(d) and Q4 are theoretical exercises, and should be solved using results we covered in lectures. Make sure you justify each step of the theoretical reasoning by clearly stating the theorem/property you are using (marks will be awarded for these). Also make sure that you add comments to each section of your R code, explaining what you’re doing. All the relevant R output (computed probabilities, plots, etc) should be included in your submission! A pdf document with your R code, R output and the solutions to the theoretical exercises should be submitted.

1. Assume that a new Conservative Party leadership election has been triggered in the UK at a time when there are 361 conservative MPs in the parliament. Two of these MPs, M and B, join the leadership contest, where the aim is to get the majority support of the remaining 359 conservative MPs. We further assume that on the day the leadership contest is announced 184 of these MPs support M, and the remaining 175 MPs support B in becoming the next party leader. The announcement is followed by an election campaign, during which MPs can decide to change their allegiance. In particular, we know that on any given day, there is a probability of 0.005 that an MP who has been supporting M will become a B supporter by the end of the day, while the probability that an MP who has been supporting B will become an M supporter by the end of the day is 0.004. Each MP makes their decision independently of each other, and independently of the decision they made the day before.

(a) [4 marks] Introduce the following random variables:

(

(1) 1, B supporter number i still supports B at the end of day 1,

Xi =

0, B supporter number i changes to an M supporter at the end of day 1,

for i = 1,...,175; and

(

(2) 1, M supporter number i changes to a B supporter at the end of day 1,

Xi =

0, M supporter number i still supports M at the end of day 1,

for i = 1,...,184.

Using these random variables express the number of B supporters at the end of the first day, then use your formula to find the expected number of B supporters at the end of the first day. Justify every step of your argument.

(b) [3 marks] Define random variables Xˆi(1), i = 1,...,175 and Xˆi(2), i = 1,...,184 whose sum gives you the number of M supporters at the end of the first day. What is the expected number of M supporters at the end of the first day?

(c) [6 marks] R: The election campaign is set to last for 2 weeks. This means that each MP would vote according to the allegiance they have at the end of day 14, that is, the candidate they would vote for is the one they are supporting after the first 14 days of the campaign. Using simulation find the probability that in this election B would hold the majority of the votes among the 359 MPs.

(d) [3 marks] R: Now suppose that the election had to be postponed, and with the new date, candidates now have a 60 day long campaign period (as opposed to 14 days). Adjust your code from part 2c to find the probability that B will win the delayed election. How does this probability compare to the one computed in part 2c?

2. Observations Y1,Y2,...,Yn are assumed to be independent and identically distributed samples from a data model following a Rayleigh distribution, with probability density function:

ye−y2/2θ f(y;θ) = for θ > 0 and 0 < y < ∞.

θ

The mean of this distribution is

r

µ = ,

and the variance is

σ2 = .

(Note that here π is not a parameter, it is the usual mathematical constant i.e. 3.14...)

(a) [2 marks] Find the method of moments estimator θ˜ of θ.

(b) [5 marks] Is your estimator θ˜ unbiased? If not, then suggest an adjustment to this estimator that would make it unbiased and report your final unbiased estimator. Hint: If E(θ˜) = cθ, then the estimator 1cθ˜ is unbiased. Also, remember that we can express second moments using the formula of the variance.

(c) [4 marks] An alternative estimator is θˆ= 21n Pni=1 Yi2. Is this estimator unbiased? If not, suggest an adjustment that makes it unbiased. See hints given in part 3b.

(d) [5 marks] Using the fact that the random variable X = Y 2 is exponentially distributed with rate , assess whether the estimator θˆ from part 3c is consistent.

(e) [6 marks] We have 150 samples from a Rayleigh distribution with sample mean 3.2. Using an appropriate point estimator of θ, suggest a suitable estimate of the variance, and use this variance estimate to construct an approximate 95% confidence interval for the mean of the distribution. (You can use R to find the relevant quantiles).

3. Consider the data set Y1,Y2,...,Yn that is assumed to have arisen from the data model with probability density function

(k(1 − y)yθ+1, 0 < y < 1,

f(y;θ) =

0, otherwise,

where θ > 0.

(a) [4 marks] Find the constant k that makes the above function a probability density function.

(b) [6 marks] Show that the maximum likelihood estimator, θˆ of θ is given by the solution to the equation:

n " n # n θˆ2Xlog(Yi) + θˆ 5Xlog(Yi) + 2n + 6Xlog(Yi) + 5n = 0.

i=1 i=1 i=1

(c) [5 marks] R: Let y1,...,y30 below correspond to 30 samples of this distribution

0.573 0.770 0.652 0.827 0.821 0.789 0.898 0.718 0.382 0.668 0.647 0.477 0.661 0.380 0.870 0.794 0.783 0.732 0.629 0.777 0.600 0.724 0.553 0.693

0.687 0.935 0.494 0.411 0.530 0.478

To produce a maximum likelihood estimate for θ based on these data, use the polyroot function of R.

Hint: Polyroot finds the roots of a polynomial. Its argument is the vector of polynomial coefficients in increasing order. For example, to find the roots of the polynomial p(x) = x2 + 2x − 3 we can use

rt <- polyroot(c(-3,2,1))

Even though both roots that you will get are real, polyroot gives these roots in complex form (don’t worry about what this means). You can use the Re() function to extract the real part of complex numbers. That is if the outcome of the polyroot function is stored in the variable rt, then we can use the following to get the desired roots.

rt_real <- Re(rt) rt_real

## [1] 1 -3

Note that this code lists all the roots of a polynomial. You will have to check which one of these is a local maximum.

(d) [3 marks] R: Produce a plot of the fitted probability density function using the estimate of θ obtained from 4c.

4. The file ‘ozone.csv’, available on the course ELE page, contains information on ozone levels recorded over 111 days from May to September 1973 in New York. The variables measured were:

ozone Ozone levels, in parts per billion (ppb),

radiation in langleys

temperature in farenheit wind in miles per hour (mph)

Read these data into R and answer the following questions.

(a) [6 marks] Carry out exploratory data analysis, and produce a matrix scatterplot of the dataset. Comment on your findings and what these plots suggest about the likely relationships between the response variable (ozone) and the other variables.

(b) [9 marks] Fit a multiple regression of ozone as the response variable, against radiation, temperature and wind as the explanatory variables (use all three, when fitting the model). Comment on the summary of the model. What do these coefficients suggest about the relationship between ozone and the other variables? Are these findings consistent with your earlier descriptive plots? Also include suitable residual plots, commenting as appropriate.

(c) [10 marks] A colleague suggests you implement the following model, log(ozonei) = β0+β1 log(radiationi)+β2 log(temperaturei)+β3 log(windi)+ϵi where ϵi ∼ N(02).

Fit this new model to the data to obtain estimates for the regression coefficients. Produce a plot of the residuals against the fitted values, and a Q-Q plot of the residuals. Comment on the outputs from the modelling (comparing it to the previously fitted model), paying particular attention to the interpretation of the coefficients. Express the impact of the explanatory variables on the ozone levels, with the latter expressed on the original (untransformed) scale.