Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

StatComp Project 2: Hints

Monte Carlo standard errors and confidence intervals

In plain Monte Carlo estimation of an expectation value, we can usually construct approximate confidence intervals by estimating the standard deviation of the estimator and construct an interval based on a Normal distribution for the estimator.  However, this breaks down in some cases.  For example, when using a randomisation test to estimate a p-value, the Normal approximation only works if the p-value p or the number of samples N are large enough, so that we actually observe non-zero counts.  Fortunately, if we observe zero counts, we can construct a confidence interval using an exact method instead of relying on the Normal approximation.

Monte Carlo randomisation test variability

Let p be the unknown p-value, and X ∼ Bin(N, p) be the random variable for how many times we observe a randomised test statistic as extreme as or more extreme than the observed test statistic. We observe X = x and estimate the p-value with  = x/N . Then

E(X ) = Np,

Var (X ) = Np(1 − p),

E() = p,

Var () =  .

We see that the variance decreases towards 0 when p → 0. We can control the expectation of  the absolute error.  Due to Jensen’s inequality (you may or may not have heard of that!)  E(| p|) ≤ ^E [|− p|2] = ^Var () ≤ , where the last step uses that the variance is maximised for p = 1/2. Thus, to guarantee E(|− p|) ≤ ϵ for some ϵ > 0, we can choose N ≥  .

The relative error has variance Var [ p ] = p(p)  which goes to when p 0, so we cannot control the relative error uniformly for all p by increasing N . We therefore need to be careful when assessing results for small values of p.

Normal approximation

When x > 0, a basic approximate 95% confidence interval for p is given by

CIp  = ± z0 .975 4  =  ± z0 .975 4  .

With the approximation z0 .975  ≈ 2, we can limit the interval width with 4^  ≤ ϵ by taking N ≥  . For ϵ = 0.02, we get N ≥ 10, 000.

Exact interval for x = 0

When the observed count is x = 0, we can go back to the definition of a confidence interval and how we can construct confidence intervals by “inverting a test”; the interval is taken to be the set for which the corresponding null hypothesis is not rejected.

Imagine that we reject the hypothesis H0  : p = p0  for some p0  if PX (X ≤ x | p0 ) < 0.025 or PX (X ≥ x | p0 ) < 0.025 (to nominally give equal tail error probability). When x = 0, the second probability is equal to 1, so that condition is never used. The test is therefore only rejected when PX (X = 0 | p0 ) < 0.025. We solve for p0 :

PX (X = 0 | p0 ) = (1 − p0 )N  < 0.025

1 p0  < 0.0251/N

p0  > 1 0.0251/N .

The set of p0  values for which the test is not rejected is p0  ≤ 1 − 0.0251/N, so when x = 0 we can define the confidence interval for p as

CIp  = (0, 1 − 0.0251/N).

To limit the width of such confidence intervals to at most some ϵ, we need 1 − 0.0251/N  ≤ ϵ, and N ≥ l0(1(.)) . This grows much more slowly thane2(4)    when ϵ → 0, so we can safely use the N that’s required to bound  the Normal approximation interval width and still fulfill the interval width criterion for the x = 0 interval construction.

Bayesian credible interval construction

An alternative approach is to construct a Bayesian credible interval for p. Let p ∼ Unif(0, 1) a priori. The posterior density for p given x is then proportional to

P(X = x|p) = (x(N))px (1 p)N x ,

which shows that the posterior distribution for p is Beta(1 + x, 1 + N − x), and a credible interval is provided by the 0.025 and 0.975 quantiles.  In R, qbeta(c(0 .025,  0 .975),  shape1  =  1  +  x,  shape2  =  1  +  N  - x). This construction works for all N ≥ 1 and all 0 ≤ x ≤ N .

Prediction standard deviations

The predict() function can provide prediction standard errors for the linear predictor, with se .fit, but those are only half the story when predicting new data. The standard errors only include the uncertainty information about the linear predictor curve. For full prediction uncertainty, we need to take the observation variation into account, which lm() estimated via the variance of the residuals. Since the residuals for new observations is assumed to be conditionally independent of the predictor curve, the prediction variance can be estimated as the sum of the square of the prediction standard error and the residual variance, if the degrees of freedom is large. For the help text for lm() we see that when se .fit=TRUE, the output list contains the elements

•  fit: vector or matrix (depending on the interval argument)

•  se .fit:  standard error of predicted means

•  residual .scale: residual standard deviations

•  df: degrees of freedom for residual

#  When  creating  a  tibble,  the  construct  can  use  variables  defined

#  first  in  the   later  variables;  here  we  use  x  when  constructing  y:

df  <-  tibble(x  =  rnorm ( 10),

y  =  2  +  x  +  rnorm ( 10 ,  sd  =  0.1))

#  Estimate  a  model:

fit  <-  lm(y  ~  x,  data  =  df)

#  Compute  prediction  mean  and  standard  deviations  and  add  to  df_pred:

df_pred  <-  data .frame(x  =  seq (-2 ,  2 ,  length .out  =  100))

pred  <-  predict(fit,  newdata  =  df_pred,  se .fit  =  TRUE)

df_pred  <-  df_pred  %>%

mutate(mean  =  pred$fit,

se.fit  =  pred$se .fit,

sd  =  sqrt(pred$se .fitˆ2  +  pred$residual .scaleˆ2))

ggplot(df_pred)  +

geom_line(aes(x,  se.fit ,  colour  =  "se .fit"))  +

geom_line(aes (x,  sd,  colour  =  "sd"))  +

ylab("Std .  deviations")

0.16

0.12

0.08

0.04

 

colour

sd

se.fit

−2                            −1                             0                              1                              2

x

If we also ask for prediction intervals, we need to modify the code a bit. From comparing the interval width

results from predict() with those from an interval assuming t-distributions, we see that they are identical up to floating point accuracy.

#  Compute  prediction  mean  and  standard  deviations  and  add  to  df_pred:

df_pred  <-  data .frame(x  =  seq (-2 ,  2 ,  length .out  =  100))

pred  <-  predict(fit,  newdata  =  df_pred,  se .fit  =  TRUE ,  interval  =  "prediction") df_pred  <-  df_pred  %>%

mutate(mean  =  pred$fit[,  "fit"],

lwr  =  pred$fit[,  "lwr"],

upr  =  pred$fit[,  "upr"],

se.fit  =  pred$se .fit,

sd  =  sqrt(pred$se .fitˆ2  +  pred$residual .scaleˆ2))

ggplot(df_pred)  +

geom_line(aes (x,  upr  -  lwr  -  (qt(0.975 ,  pred$df)  -  qt(0.025 ,  pred$df))  *  sd))  +

ylab("Interval  width  difference")  +

xlab("x")

 

4.884981e−16

2.442491e−16

0.000000e+00

−2.442491e−16

−4.884981e−16

 

−2                              −1                               0                                1                                2

x

Handling non-Gaussian precipitation

While the assessment methods requested in the project description are valid for non-Gaussian predictions, the

non-Gaussianity of the precipitation data is still very noticeable on the monthly average scale. An effect of this is that the constant variance assumption of a basic lm() model isn’t a good fit to the data. To improve

this, you can take the square root of the monthly averages before applying the modelling in part 2. You may also do all the model and prediction assessment on this square-root version of the data.