Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


FIT2086 Assignment 1


Introduction

There are a total of five questions worth 4 + 6 + 6 + 7 + 7 = 30 marks in this assignment. Please note that working and/or justification must be shown for all questions that require it.

This assignment is worth a total of 10% of your final mark, subject to hurdles and any other matters (e.g., late penalties, special consideration, etc.) as specified in the FIT2086 Unit Guide or elsewhere in the FIT2086 Moodle site (including Faculty of I.T. and Monash University policies).

Students are reminded of Monash University’s policies on academic integrity. In submitting this as-signment, you acknowledge your awareness of Monash University’s policies on academic integrity and that work is done and submitted in accordance with these policies.

Submission: Please submit a single PDF file containing your answers via Moodle. Scans of hand-written answers are acceptable but they must be clean and legible. You must ensure your submission contains answers to the questions in the order they appear in the assignment. Submission must occur before 11:55 PM Friday, 20th of August, and late submissions will incur penalties as per Faculty of I.T. policies.


Question 1 (4 marks)

In Lecture 1 we learned about several different types of general data science techniques/applications: (i) risk prediction, (ii) recommendation systems, (iii) forecasting, (iv) anomaly detection, (v) image recognition systems. For each of the following problems, suggest which of these application types the problem belongs to and justify your selection:

1. Measuring brain activity to determine whether or not someone is likely to suffer an epileptic seizure within the next day? [1 mark]

2. Estimating the number of people at a sporting event from television footage? [1 mark]

3. Discovering the video preferences of YouTube users? [1 mark]

4. Predicting the number of new face masks that will be sold over the next month? [1 mark]

Table 1: Empty table of the joint proportions of Manchester United (W = 1)/not winning (W = 0) a football match when playing at home (H = 1)/playing away (H = 0).


Question 2 (6 marks)

It is common in many sporting leagues for teams to alternate playing games at their own venue (i.e., “at home”) and at other team’s home venues (i.e., “away”). It is usually assumed that teams will play better at home, when they have support from their own fans, than when they play away. Manchester United is a major footballing club in the English Premier League; let us consider the home and away win rates for this particular team. This is a (simple) example of sports analytics, an area of data science which is rapidly growing in importance over the last few years. The information regarding the number of home (H = 1) and away (H = 0) wins (W = 1), as well as home and away losses or draws (W = 0) for the two-year period from December 2018 through to December 2020 is as follows:

● 27 home games won;

● 24 home games not won;

● 33 away games won;

● 23 away games not won.

Using this data please answer the following questions; you must provide working/justification.

1. Using the frequencies provided above fill in the entries of Table 1 with the proportions of the times those events occurred, i.e., estimates of the joint probabilities of a win/no-win for home/away games (up to 3 decimal places). [1 mark]

2. Using these proportions, calculate the marginal probability of Manchester United winning a game, regardless of whether it is played at home or away, i.e., P(W = 1). [1 mark]

3. What is the probability that Manchester United will win a game, given that they are playing at home? [1 mark]

4. What is the probability that Manchester United will win a game, given that they are playing away? [1 mark]

5. Do you believe that Manchester United plays better when at home versus when they play away? [1 mark]

6. Imagine that Manchester United will play a home game, then an away game, and then a home game in their next three games. What is the probability that they will win two out of three of these games? [1 mark]


Question 3 (6 marks)

Imagine that we roll a fair six-sided die and a fair four-sided die (i.e., all sides have the same probabil-ity). Let X1 and Y1 be the independent random variables representing the outcomes of those events respectively. Let S = X1 + 2 Y1 be the sum of the outcome of the roll of the six-sided die and two times the outcome of the roll of the four-sided die. Please answer the following questions with appropriate working/justification.

1. What is the variance of S, i.e., what is V [S]? [1 mark]

2. Determine the probability distribution of S, i.e., the probability that S = {3, . . . , 14}. [1 mark]

3. What is the expected value of S3 , i.e., what is E [S3]? [1 mark]

4. Calculate the approximate value of E [S3] using the Taylor-series procedure discussed in Lecture 2. [2 marks]

5. Imagine that we roll a second fair six-sided die; call the outcome of this roll X2. What is the expected value of (X+ 2Y1 + X2)2 , i.e., what is E [(X1 + 2Y1 + X2)2]? [1 mark]


Question 4 (7 marks)

Imagine that a continuous random variable X defined on the range [0, s] follows the probability density function

Answer the following questions; you must include appropriate working.

1. Plot the probability density function of X when s = 1 and s = 2 for x ∈ (0, 3). [2 marks]

2. Determine the expected value of X, i.e., E [X]. [1 mark]

3. Determine the variance of X, i.e., V [X]. [1 marks]

4. Determine the expected value of 1/X, i.e., E [1/X]. [1 mark]

5. Determine the median of X. [2 marks]

(hint: the answers to Q4.2 through Q4.5 will all be functions of s).


Question 5 (8 marks)

In this question we will analyse some topical and relevant data: recovery times for people infected with the novel coronavirus (COVID-19). The data we will use is a rough version of the data used in a paper published by researchers from New South Wales in July, 2021, and covers 84% of all COVID-19 cases diagnosed within the state of New South Wales from January 2020 through to May 2020. A number of outlying patients with very long recovery times were pruned from the original data to make analysis a bit easier. As you will likely know, COVID-19 is a highly infectious virus with potentially life threatening flu-like symptoms. Accurately determining time-to-recovery of infected individuals is extremely useful for predicting loads on health care systems.

        The file covid.19.recovery.csv contains the number of (whole) days taken by n = 2, 586 indi-viduals to be deemed “recovered” after contracting COVID-19. Answer the following questions; you must provide relevant R statements, working or justification as appropriate to obtain full marks.

1. Fit a Poisson distribution to the COVID-19 recovery data using the maximum likelihood esti-mator for λ. What is the value of this parameter for this data? [1 marks]

2. Plug the estimate  into the Poisson distribution, and use this to answer the following questions:

(a) What is the probability of a recovery time being at least two weeks (i.e., 14 days)? [1 mark]

(b) What are the two most likely numbers of days that it will take an individual to recover? [1 mark]

(c) Imagine that five people have contracted COVID-19. What is the probability that at least 2 of these people will have recovered before day 13? [2 mark]

3. The quality of predictions of a model are only as good as the model is itself representative of the population. Do you believe that the Poisson distribution is an appropriate model for the recovery time data? Plot the observed probabilities for the various days till recovery against the probabilities predicted by your Poisson model, and use this to justify whether or not this model is a good fit to the data. [2 marks]