Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MTHM502 Introduction to Data Science and Statistical Modelling

Assignment

Please make sure that the submitted work is your own.   This is NOT a group assignment, therefore approaches, solutions shouldn’t be discussed with other students.  Plagiarism and collusion with other students are examples of academic misconduct and will be reported. More information on academic honesty can be found here.

1. The colour of the human eye is determined by a pair of genes. If both of these genes code the colour blue, then the given person will have blue eyes.  If at least one of the genes codes the colour brown, then the person will have brown eyes. That is, if we denote by‘A’the gene coding the colour brown, and by‘a’the gene coding the colour blue, then we have the following

Eye colour

AA

Aa

aA

aa

A child inherits one gene from each of their parents.   That is one gene is chosen randomly  (with equal probability) from the gene-pair of their father, and one gene is chosen randomly (with equal probability) from the gene-pair of their mother.  Below are two examples, where the entries of the tables show the possible gene-pairs of the children.   Note that each of these gene-pairs has equal probability.

Example 1:

  Father’s    genes

       A            a     

Mother’s    A       AA         Aa

genes       a        aA          aa

Example 2:

Father’s

A

AA

aA

Assume that Aaron and both of his parents have brown eyes, but Aaron’s sister has blue eyes.

(a)  [3 marks] What is the probability that Aaron has a blue eye gene?

(b)  [6 marks] Assume that Aaron’s wife has blue eyes. What is the probability that their first child will have blue eyes?

(c)  [10 marks] Suppose that Aaron and his wife’s first child ended up having brown eyes (and not blue). How does this information change the probability that Aaron has a blue eye gene? What is the probability that their second child will have brown eyes too?

2. Assume that a new Conservative Party leadership election has been triggered in the UK at a time when there are 361 conservative MPs in the parliament.  Two of these MPs, M and B, join the leadership contest, where the aim is to get the majority support of the remaining 359 conservative MPs.            We further assume that on the day the leadership contest is announced 184 of these MPs support M, and the remaining 175 MPs support B in becoming the next party leader.  The announcement is followed by an election campaign, during which MPs can decide to change their allegiance.   In particular, we know that on any given day, there is a probability of 0.005 that an MP who has been supporting M will become a B supporter by the end of the day, while the probability that an MP who

has been supporting B will become an M supporter by the end of the day is 0.004.  Each MP makes their decision independently of each other, and independently of the decision they made the day before.

(a)  [4 marks] Introduce the following random variables:

X = ptt(n)he(ay) 

for i = 1, . . . , 175; and

X = ll(a)t(a)t(p)t(p)h(o)

for i = 1, . . . , 184.

Using these random variables express the number of B supporters at the end of the first day, then use your formula to find the expected number of B supporters at the end of the first day. Justify every step of your argument.

(b)  [3 marks] Define random variables Xˆ  , i = 1, . . . , 175 and Xˆ  , i = 1, . . . , 184 whose sum gives

you the number of M supporters at the end of the first day. What is the expected number of M supporters at the end of the first day?

(c)  [6 marks] R: The election campaign is set to last for 2 weeks. This means that each MP would vote according to the allegiance they have at the end of day 14, that is, the candidate they would vote for is the one they are supporting after the first 14 days of the campaign. Using simulation find the probability that in this election B would hold the majority of the votes among the 359 MPs.

(d)  [3 marks] R: Now suppose that the election had to be postponed, and with the new date, candi- dates now have a 60 day long campaign period (as opposed to 14 days). Adjust your code from part 2c to find the probability that B will win the delayed election.  How does this probability compare to the one computed in part 2c?

3. Observations Y1 , Y2 , . . . , Yn  are assumed to be independent and identically distributed samples from a data model following a Rayleigh distribution, with probability density function:

f (y; θ) =     for θ > 0 and 0 < y < .

The mean of this distribution is

µ =  ,

and the variance is

2      .

(Note that here π is not a parameter, it is the usual mathematical constant i.e. 3.14...)

(a)  [2 marks] Find the method of moments estimator θ˜ of θ .

(b)  [5 marks] Is your estimator θ˜ unbiased? If not, then suggest an adjustment to this estimator that

would make it unbiased and report your final unbiased estimator .   Hint:   If E (θ˜) = cθ,  then  the

estimator θ˜ is unbiased.  Also, remember that we can express second moments using the formula

(c)  [4 marks] An alternative estimator is θˆ =   Yi2 . Is this estimator unbiased? If not, suggest

(d)  [5 marks] Using the fact that the random variable X = Y2  is exponentially distributed with rate

 , assess whether the estimator θˆ from part 3c is consistent.

(e)  [6 marks] We have 150 samples from a Rayleigh distribution with sample mean 3.2.  Using an appropriate point estimator of θ, suggest a suitable estimate of the variance, and use this variance estimate to construct an approximate 95% confidence interval for the mean of the distribution. (You can use R to find the relevant quantiles).

4.  Consider the data set Y1 , Y2 , . . . , Yn that is assumed to have arisen from the data model with probability

density function

f (y; θ) = 

where θ > 0.

(a)  [4 marks] Find the constant k that makes the above function a probability density function.

(b)  [6 marks] Show that the maximum likelihood estimator, θˆ of θ is given by the solution to the equation:

θˆ2 log(Yi ) + θˆ [5 log(Yi ) + 2n] + 6 log(Yi ) + 5n = 0.

(c)  [5 marks] R: Let y1 , . . . , y30  below correspond to 30 samples of this distribution

0.573

0.770

0.652

0.827

0.821

0.789

0.898

0.718

0.382

0.668

0.647

0.477

0.661

0.380

0.870

0.794

0.783

0.732

0.629

0.777

0.600

0.724

0.553

0.693

0.687

0.935

0.494

0.411

0.530

0.478

To produce a maximum likelihood estimate for θ based on these data, use the polyroot function of R.

Hint: Polyroot finds the roots of a polynomial. Its argument is the vector of polynomial coefficients in increasing order.  For example, to find the roots of the polynomial p(x) = x2 + 2x − 3 we can

use

rt  <-  polyroot(c (-3 ,2 ,1))

Even though  both roots  that you will get are real, polyroot gives  these roots in  complex form (dont worry about what this means) .   You can use the Re() function to extract the real part of complex numbers.  That is if the outcome of the polyroot function is stored in the variable rt, then we can use the following to get the desired roots.

rt_real  <-  Re (rt)

rt_real

##  [1]    1  -3

Note that this code lists all the roots of a polynomial.  You will have to check which one of these is a local maximum.

(d)  [3 marks] R: Produce a plot of the fitted probability density function using the estimate of θ obtained from 4c.

5. The file ‘ozone.csv’, available on the course ELE page, contains information on ozone levels recorded over 111 days from May to September 1973 in New York. The variables measured were:

ozone

radiation

temperature

wind

Ozone levels, in parts per billion (ppb), in langleys

in farenheit

in miles per hour (mph)

Read these data into R and answer the following questions.

(a)  [6 marks] Carry out exploratory data analysis, and produce a matrix scatterplot of the dataset.

Comment on your findings and what these plots suggest about the likely relationships between the response variable (ozone) and the other variables.

(b)  [9  marks]  Fit  a  multiple  regression  of  ozone  as  the  response  variable,  against  radiation, temperature  and wind  as the explanatory variables  (use all three, when fitting the model).

Comment on the summary of the model. What do these coefficients suggest about the relationship between ozone and the other variables? Are these findings consistent with your earlier descriptive plots? Also include suitable residual plots, commenting as appropriate.

(c)  [10 marks] A colleague suggests you implement the following model,

log(ozonei ) = β0 +β1 log(radiationi )+β2 log(temperaturei )+β3 log(windi )+ei     where ei   N (0, σ2 ).

Fit this new model to the data to obtain estimates for the regression coefficients. Produce a plot of the residuals against the fitted values, and a Q-Q plot of the residuals. Comment on the outputs from the modelling (comparing it to the previously fitted model), paying particular attention to the interpretation of the coefficients. Express the impact of the explanatory variables on the ozone levels, with the latter expressed on the original (untransformed) scale.

Total for paper = 100 marks.

The  submitted  work  should  be  your  own  work!   The  questions  apart  from  Q2(c),  Q2(d), Q4(c), Q4(d) and Q5 are theoretical exercises, and should be solved using results we covered in lectures. Make sure you justify each step of the theoretical reasoning by clearly stating the theorem/property you are using (marks will be awarded for these).  Also make sure that you

add comments to each section of your R code, explaining what you’re doing.  All the relevant R output  (computed probabilities, plots, etc) should be included in your submission!  A pdf document with your R code, R output and the solutions to the theoretical exercises should be

submitted through EBART by Noon (12pm), 2nd December. Note that late submissions will be penalised.