闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MATH6166 Statistical Computing for Data Scientist — Coursework

Instructions for Coursework

1. This coursework assignment is worth 100% of the overall mark for the module MATH6166.

2. The deadline is 1500 on Thursday 17 November 2022, and your completed work must be submitted electronically via Blackboard.

3. Your submission must consist of exactly one PDF ﬁle, which includes all the plots/graphics asked in the questions, together with one R script ﬁle containing all the R code you used to obtain your results for all questions. Missing R codes will be penalised. In the PDF, you must label clearly with which questions the ﬁgures are associated. In the R script, you must use comments to make it clear to which questions the R codes are answering. Please ensure the submitted ﬁles have the correct ﬁle format; otherwise it cannot be marked. Note an R markdown ﬁle is a diﬀerent format to an R script ﬁle.

4. Please name your PDF ﬁle as .pdf. For example, if your student ID is 12345678, then your report must have the ﬁle name 12345678 .pdf.

5. Your R script ﬁle must have the ﬁle name .R. For example, if your student ID is 12345678, then your script must have the ﬁle name 12345678 .R.

6. Your entire R script ﬁle must be reproducible and run without any errors.

7. You must present elegant and informative plots in the PDF ﬁle, including the use of appropriate plotting area, points/lines sizes and colours, as well as including meaningful labels, titles and legends.

8. You must informatively comment your R code indicating where each question and task begins. When writing an R function, the 1) function and 2) argument names must be exactly the same as those speciﬁed in the questions; otherwise a heavy penalty will be applied.

● Please note that uppercase and lowercase of the same alphabet are not the same character, so if the question asks to write a function/argument called, apple, but a function/argument called

applE is presented in your answer, than that would be wrong.

9. You must not load any add-on packages, i.e., using functions library or require etc.

10. The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that the coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work which is more than one week late.

11. All coursework must be carried out independently. You are reminded of the University’s Academic Integrity Policy, see https://www.southampton.ac.uk/quality/assessment/academic_integrity.page.

12. In the interests of fairness, queries which relate directly to the coursework must be made via the MATH6166 Discussion Forum on Blackboard. This ensures that everybody has access to the same information.

Total marks available: 100

Question 1: Wild Life Population Density

[30 marks] An ecologist is interested in the population density of a bird species at various sites across three diﬀerent habitat types. The data collected by the ecologist is stored in the data ﬁle ecoStudy .txt, which has two variables:

● density: the population density recorded at each site (count per km2 ),

● habitat: the habitat type of the site.

(a) [ 6 marks] Import the dataset from the ecoStudy .txt ﬁle into the R workspace.

(i) Find the minimum, ﬁrst quartile, median, mean, third quartile, and maximum values of the density variable.

(ii) Find the number of observations in each habitat type.

Report your answer as comments below your R commands for each part of this question in the R script. Where the answer is not an integer, round to 3 decimal places.

(b) [ 6 marks] Create a new variable called logDensity in the data .frame object containing the data imported from ecoStudy .txt, where logDensity is deﬁned by

log(density + 1). (1)

(i) Create a side-by-side box plot of density for all habitat types.

(ii) Create a side-by-side box plot of logDensity for all habitat types.

(iii) Use the appropriate R command(s) to arrange the graphs from (i) & (ii) into a lattice of two rows and one column.

(c) [8 marks] We would like to test whether the population mean density is the same across all habitat types. A common approach is to apply the F-test in an one-way analysis of variance (ANOVA). The F-test statistic, Fdl ,d2 , is given by

Ni (y¯i ì y¯)2 /(K ì 1) (2)

where yij is the density value of the jth observation in the ith habitat type, y¯i is the sample mean density of the ith habitat type and y¯ is the overall sample mean density. The term K represents the number of habitat types and Ni is the number observations in the ith habitat type.

The degrees of freedom d1 and d2 of the F-statistic are respectively deﬁned by K ì 1 and N ì K, where N is the total number of observations in the dataset.

(i) Write your own R commands to calculate the F-test statistic for testing the hypothesis that the the population mean density is same across all habitat types.

Important: for the mathematical calculations, you may only use (but do not need to include all) the operators +, -, *, /, ˆ, %% and the functions mean() and sum().

(ii) Use the pf() function to ﬁnd the p-value of the F-test.

Hint: the p-value is the probability of observing an F-statistic equal or larger than the one calculated in part (i).

Report your answer (to 3 signiﬁcant ﬁgures) as comments below your R commands for each part of this question in the R script.

(d) [4 marks] Repeat part (c); however, this time you must use one of the apply functions. Apply functions include apply(), sapply(), tapply etc. You need to choose the appropriate function to perform the task correctly. Report your answer (to 3 signiﬁcant ﬁgures) as comments below your R commands for this question in the R script.

(e) [ 6 marks] Calculate the residuals of the F-test and create a Q-Q plot to visually examine whether the residuals are normally distributed. The residual value of the jth observation of the ith habitat is yij ì y¯i , where yij and y¯i are deﬁned in part (c). Make sure to include a reference line in your Q-Q plot. In the R script, as comments below the R commands for this question, write a brief description on whether the normality assumption has been satisﬁed.

Question 2: Mixture Models

[ 16 marks]

The probability density function of a gamma mixture model with two components is given by

fX (x) = p + (1 ì p) , (3)

where x is the observed value of the random variable X , which follows the above gamma mixture model. The parameters α 1 , α2 , β 1 , and β2 only take positive values, while the parameter p is a probability.

(a) [ 6 marks] Write a function calcLogMix2Gamma that calculates the natural logarithm of the joint probability density for any number of independent random variables X1 , ..., Xn assumed to be distributed according to the gamma mixture model in equation (3). The calcLogMix2Gamma function have the following features.

Arguments:

(1) x is a numeric vector of positive values only, representing the observed values of X1 , ..., Xn , where n can be any positive integer, which means x can be of any length.

(2) alpha is a positive numeric vector of length two, where the ﬁrst and second elements respectively correspond to the values of α 1 and α2 in equation (3).

(3) beta is a positive numeric vector of length two, where the ﬁrst and second elements respectively represent the values of β 1 and β2 in equation (3).

(4) p is a single numeric value between 0 and 1 inclusive, representing p in equation (3).

Computation:

● The calcLogMix2Gamma function calculates the natural logarithm of the joint probability density of x, given alpha, beta and p according equation (3).

Return:

● A single numeric value as calculated in Computation.

Important: You must write you own code to perform the speciﬁed computation, i.e., you must not use any existing functions for calculating the density of gamma distributions or mixture models; otherwise a heavy penalty will be applied.

Hint: the functions gamma() or lgamma() maybe used.

(b) [ 6 marks] To enable the calculations in part (a), the calcLogMix2Gamma function needs to ensure objects passed the arguments satisfy the following requirements.

(i) x is a numeric vector containing positive values only.

(ii) alpha and beta are both positive numeric vectors of length two.

(iii) p is a single numeric value between 0 and 1 inclusive.

Add additional R commands at the beginning of your calcLogMix2Gamma function in part (a) to check that valid inputs have been provided before any calculations. If invalid inputs are passed to this function, it will stop prematurely by using the appropriate R function and print out an informative error message.

(c) [4 marks] Let 尔 be the observed values of random variables (X1 , ..., X10 ), each distributed according to the model in equation (3). Use the calcLogMix2Gamma function to calculate the natural logarithm of the joint probability density for the observed values

尔 = (0.06, 2.32, 4.81, 0.02, 2.33, 2.18, 0.83, 2.45, 2.10, 3.27), (4)

given the parameter values (α1 , α2 ) = (0.5, 6), (β1 , β2 ) = (0.5, 2.5) and p = 0.35. Report your answer (to 3 decimal places) as comments below your R commands for this question in the R script.

Question 3: Riemann Integration

[31 marks] Figure 1 illustrates the function below:

g(x) = sin ╱ log ╱ 、、 , for ì ↓ < x < ↓ ,

where α , β , p, q and k are the parameters of function g(x), while α > 0 and β > 0.

(a)

1.0

0.5

0.0

−0.5

−1.0

a b

(b)

1.0

0.5

0.0

−0.5

−1.0

Figure 1: The curve represented by function g(x) in equation (5): In panel (a), the area shaded in grey is the area between g(x) and the x-axis within the x-interval [a, b], while the shaded region in panel (b) is the area between g(x) and the x-axis within the x-interval [a\ , b\]

When a curve g(x) is above the x-axis for the x-interval [a, b] as in Figure 1 (a), we can calculate the area

between g(x) and the x-axis within [a, b] using the middle Riemann sum to approximate the integrationa(b) g(x).

(ab g(x) ↓ i1 (xi ì xi μ1)g ╱ 、

(6)

with subintervals deﬁned by a = x0 < ... < xN = b.

(a) [ 6 marks] Write an R function smoothCurve that calculates the mathematical function g(x) deﬁned in equation (5).

Arguments:

(1) x is a numeric value or a numeric vector.

(2) alpha is a single positive numeric value.

(3) beta is a single positive numeric value.

(4) p is a single numeric value.

(5) q is a single numeric value.

(6) k is a single numeric value.

Computation:

● This function calculates the value(s) of g(x) given by equation (5).

Return:

● If x is a numeric value, this function returns the numeric value calculated in Computation, or

● if x is a numeric vector, the function returns the numeric vector calculated in Computation.

(b) [5 marks] Write a function called calcMidRiemannLoop that calculates the area between g(x) in equation (5) and the x-axis for any given x-interval [a, b] by using the middle Riemann sum.

Arguments:

(1) xVec is a numeric vector with elements corresponding to x0 , ..., xN in equation (6).

Note: You can assume that for marking, the x-intercepts will not fall within any of the subintervals; however, the subintervals may have diﬀerent lengths. The x-intercepts are the values of x that give f (x) = 0.

(2) — (6) alpha, beta, p, q and k have the exact same speciﬁcations as in part (a) .

Computation:

● This function calculates the area between the x-axis and g(x) in equation (5) according to the middle Riemann sum with subintervals deﬁned by xVec.

Hint: An area is always positive.

Return:

● A numeric value which is the area calculated in Computation.

Important: You must use a loop to answer this question; otherwise a heavy penalty will be applied.

(c) [4 marks] Write a function called calcMidRiemann which has the exact same Arguments, Computation and Return speciﬁcations as in part (b), except you must not use a loop for this part.

(d) [4 marks] For the calcMidRiemann function to work the function need to check that the valid inputs are passed to the argument, speciﬁcally, its argument xVec must be a numeric vector with values in increasing order.

Add additional R commands at the beginning of your calcMidRiemann function in part (c) to check that valid inputs have been provided before any calculations. If invalid inputs are passed to this function, it will stop prematurely by using the appropriate R function and print out an informative error message.

(e) [3 marks] Use the calcMidRiemann to calculate the area between x = 2 and x = 8.5 with the sequence a = x0 = 2.0 < 2.01 < ... < 8.49 < 8.5 = xN = b, given the parameter values α = 2.1, β = 0.5, p = 3, q = 6 and k = 2. Report your answer (to 3 decimal places) as comments below your R commands for

this question in the R script.

(f) [ 5 marks] Write a function called calcMidRiemannAreas, which calculates the areas between the g(x) curve and the x-axis for multiple intervals on the x-axis. This function has the following speciﬁcations.

Arguments:

(1) xSeqList is a list of numeric vectors, where each vector corresponds to an increasing numeric sequence x0 , ..., xN , and the lengths of the vectors do not need to be the same.

(2) — (6) alpha, beta, p, q and k have the exact same speciﬁcations as in part (a) .

Computation:

● For each numeric sequence (deﬁned by each vector in xSeqList), this function calculates the area between the x-axis and g(x) in equation (5) according to the middle Riemann sum with subintervals deﬁned by each vector of xSeqList.

Return:

● A numeric vector containing the areas as calculated in Computation.

(g) [4 marks] Use the calcMidRiemannAreas to calculate the areas within the x-intervals deﬁned by the following sequences

3.50 < 3.51 < ... < 7.99 < 8.00 2.0 < 2.1 < ... < 4.2 < 4.3

1.07 < 1.071 < ... < 9.011 < 9.012

given the parameter values α = 1.9, β = 0.75, p = 2.15, q = 7 and k = 1.65. Report your answer (to 3 decimal places) as comments below your R commands for this question in the R script.

Question 4: Hidden Markov Model

[23 marks] Consider a sequence of weather events X1 , X2 , ..., Xn , where each Xi can be one of the three possible weather states: sunny, cloudy or rainy on the ith day for all i in {1, ..., n≠ . Assume that the probability of the weather on the ith day, Xi only depends on the weather on the previous day Xi μ1 for all i in {2, ..., n≠ . This means that

p(Xi |X1 , ..., Xi μ1) = p(Xi |Xi μ1), (7)

where the conditional probability p(Xi = b|Xi μ1 = a) is termed the transition probability from weather state a to whether state b. Thus, we can deﬁne the transition probability matrix PT as

PT = ╱

．prs

psc

pcc

prc

、

prr ． ,

(8)

where the subscripts s, c and r represent sunny, cloudy and rainy, respectively. For example, pcs = p(Xi = sunny|Xi μ1 = cloudy) for the all i in {2, ..., n≠ . The initial probabilities for the weather state of X1 is given by π = (πs , πc , πr ), so for example, p(X1 = sunny) = πs .

The probability of a sequence of weather events p(X1 , X2 , ..., Xn ) is then given by

p(X1 , X2 , ..., Xn ) = p(X1 ) p(Xi |Xi μ1).

i=2

For example, the probability of observing cloudy on day 1, rainy on day 2 and sunny on day 3 is therefore

p(X1 = cloudy, X2 = rainy, X3 = sunny)

=p(X3 = sunny|X2 = rainy)p(X2 = rainy|X1 = cloudy)p(X1 = cloudy) =prspcr πc .

(a) [8 marks] Write a function called weatherSeqProb, which has the following speciﬁcations.

Arguments:

(1) weatherSeq is a character vector, where each element must one of the three character "s", "c" or "r".

(2) trProbs is a numeric matrix object with three rows and three columns, where the values are the transition probabilities arranged according to equation (11).

(3) initProbs is a numeric vector with three elements, where the ﬁrst, second and third elements correspond to the initial probabilities πs , πc and πr .

Computation:

– Calculate the natural logarithm of the probability of observing the sequence of weather specify by weatherSeq given the transition probability matrix trProbs and initial state probabilities initProbs.

Return:

– A numeric value, which is the natural log probability calculated in Computation. (b) [3 marks] Calculate the natural logarithm of the probability for observing

X1 = cloudy, X2 = sunny, X3 = cloudy, X4 = rainy, X5 = sunny and X6 = sunny,

given π = (0.45, 0.25, 0.3) and

、

0.40． .

(10)

(11)

Report your answer (to 3 decimal places) as comments below your R commands for this question in the R script.

(c) [8 marks] Our hypothetical friend chooses their jacket (black or white) depending on weather with some randomness. The conditional probability matrix for the jacket chosen given the weather is deﬁned by the matrix

PE = ╱、 (12)

．prB prW ． ,

where the subscripts B and W stand for black and white (jackets), respectively. For example, pcW is probability that our friend wears a White jacket on a cloudy day. Note that in the subscripts, the weather states are in lowercase, while the colour states are in uppercase!

Let Yi represent the jacket colour on the ith day. The joint probability of a sequence of events weather and colour events p(X1 , X2 , ..., Xn , Y1 , Y2 , ..., Yn ) is then given by

p(X1 , X2 , ..., Xn , Y1 , Y2 , ..., Yn ) = p(Y1 |X1 )p(X1 ) p(Yi |Xi )p(Xi |Xi μ1). (13)

i=2

For example, the joint probability of observing cloudy weather and white jacket on day 1, rainy weather and black jacket on day 2 and sunny weather and black jacket on day 3 is

p(X1 = cloudy, X2 = rainy, X3 = sunny, Y1 = white, Y2 = black, Y3 = black)

= p(Y3 = black|X3 = sunny)p(X3 = sunny|X2 = rainy)

K p(Y2 = black|X3 = rainy)p(X2 = rainy|X1 = cloudy)p(Y1 = white|X1 = cloudy)p(X1 = cloudy) = psBprsprBpcrpcW πc .

Write a function called weatherColourProbs with the following speciﬁcations.

Arguments:

(1) colourSeq is a character vector, wherein each element is restricted to be either "B" or "W".

(2) emitProbs is a numeric matrix object with three rows and two columns, wherein the values are the probabilities of the jacket colour conditioned on weather as arranged according to equation (12).

(3) — (5) weatherSeq, trProbs and initProbs are speciﬁed exactly the same as in part (a).

Computation:

– Calculate the natural logarithm of the joint probability of observing the sequence of weather speciﬁed by weatherSeq and the sequence of jacket colour speciﬁed by colourSeq, given the probability arguments emitProbs, trProbs and initProbs.

Return:

– A numeric value, which is the natural log probability calculated in Computation. Important: In Computation, you must not repeat the codes in weatherSeqProb here.

(d) [4 marks] Calculate the natural logarithm of the joint probability for observing

X = {X1 , X2 , ..., X8 ≠ </

2022-11-12

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths