关键词 > STA465/2016
Homework 2: STA465/2016
发布时间:2023-02-23
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Homework 2: STA465/2016
Homework 2 is due on Friday, February 17th. The homework assignment is worth 25 points in total.
Question 1: Prior Predictive Distributions (5 pts)
Given a prior distribution, the distribution of unknown but observable y is,
p(y) = \ p(y|θ)p(θ)dθ
where p(y) reflects the distribution of the observations and p(θ) denotes the prior distribution of the param- eters in the model. This is called the prior predictive distribution.
To sample from this distribution, we can do so in two steps:
• first sample θj from the prior distribution
• using the value of θj , sample yj from p(y|θj )
Question 1.1: Linear Regression
Let’s return to the simple linear regression model. Simulate the following data set:
# Setting the values of the parameters
#
beta0 <- 1
beta1 <- 0.5
sigma <- 1
# Simulating covariate values + data
#
set .seed(17)
x <- runif(n = 100 , min = 1 , max=5)
y .mean <- beta0 + beta1*x
y <- rnorm (n = 100 ,
mean = y .mean,
sd = sigma)
sim.data <- tibble(x,y, y .mean)
For the three following sets of prior distributions, generate 20 data sets from the respective prior predictive distribution. Prior distribution candidates:
• β0 ∼ N (0, σβ0 = 1), β1 ∼ N (0, σβ 1 = 1), σ ∼ Gamma(shape = 1, scale = 1)
• β0 ∼ N (0, σβ0 = 1000), β1 ∼ N (0, σβ 1 = 1000), σ ∼ Gamma(shape = 1000, scale = 1000)
• β0 ∼ Unif (0, 1), β1 ∼ Unif (−1, 0), σ ∼ Exp(1)
For each sets of prior distributions, make two plots to relay the results:
• Graph the 20 prior predictive data sets and overlay the true line.
• Draw the curve of the prior distribution and add a vertical line at the true value of β0 , β 1 , σ .
Write a short paragraph on the major implications of the choice of the prior distribution as seen via simulation from the prior predictive.
Question 1.2: Multilevel/Hierarchical Model
Let’s extend the simple linear model to a multilevel model. We will use the following simulated data set for this assignment:
# Setting the values of the parameters
#
set .seed(17)
nu.mu <- 2
tau.mu <- 0.5
nu .beta <- -1
tau .beta <- 0.5
mu.hm <- rnorm (n=20 , mean = nu .mu, sd=tau .mu)
beta.hm <- rnorm (n=20 , mean = nu .beta, sd= tau .beta)
sigma <- 1
# Simulating covariate values + data
#
x.hm <- runif(n = 100 , min = 1 , max=5)
y .mean .hier <- c (rep (mu .hm, each = 100) +
rep(beta.hm , each = 100)*
rep(x.hm , 20))
y .hier <- rnorm (n = 20*100 , mean = y .mean .hier, sigma)
sim.data.hier <- tibble(x = rep (x .hm, 20), y .hier,
y .mean .hier,
group = paste("Group" ,
rep ( 1 :20 , each = 100)))
For the three following sets of prior distributions, generate a data set of the same size from the respective prior predictive distribution. Prior distribution candidates:
• νµ ∼ N (0, 1), νβ ∼ N (0, 1), τµ ∼ Gamma(shape = 1, scale = 1), τβ ∼ Gamma(shape = 1, scale = 1), σ ∼ Gamma(shape = 1, scale = 1)
• νµ ∼ N (0, σ = 1000), νβ ∼ N (0, σ = 1000), τµ ∼ Gamma(shape = 1000, scale = 1000), τβ ∼ Gamma(shape = 1, scale = 1000), σ ∼ Gamma(shape = 1000, scale = 1000)
• νµ ∼ Unif (0, 1), νβ ∼ Unif (0, 1), τµ ∼ Exp(1), τβ ∼ Exp(1), σ ∼ Exp(1)
For each sets of prior distributions, make three plots to relay the results:
• Graph the prior predictive data sets and overlay the line from which the data was generated.
• Overlay the 20 original lines (individually, in separate panels) onto the 20 simulated lines
• Draw the curve of the prior distributions and add a vertical line at the true value of νµ , νβ , τµ , τβ , σ .
Write a short paragraph on the major implications of the choice of the prior distribution as seen via simulation from the prior predictive.
Question 2: Fitting Linear Models (5 pts)
Question 2.1: Fitting a Linear Regression Model
load("bayes-vis .RData")
latcab <- st_as_sf (GM[GM$super_region == 5 ,])
Let’s fit a linear regression model in INLA to the PM2.5 data set, using only the data from Latin Amer- ica/Caribbean and two different sets of priors (your choice). Plot histograms of the posterior draws for each parameter. Report the 95% credible intervals for each parameter estimate. Organize the results in a table. Remember that we are modeling log(pm25) ~ log(sat_2014).
Question 2.2: Fitting a Multilevel Regression Model
Let’s fit a multilevel regression model in INLA to the PM2.5 data set using two different sets of priors (your choice). Plot histograms of the posterior draws. Report the 95% credible intervals for each parameter estimate.
Question 3: Posterior Predictive Distributions (5 pts)
The posterior predictive distribution for a new observation y~, can be written as,
\
To sample from the posterior predictive distribution, we can do so in two steps:
• first sample θj from the posterior distribution p(θ|y) in INLA
• using the value of θj , sample y~j from p(y~|θj )
Question 3.1
Using the results from the two models in Question 2.1, generate 100 data sets from the posterior predictive distribution. Draw density curves comparing the simulated data sets to the original data set. Write 1-2 sentences per plot about how the simulated data sets compare to the original data set.
Question 3.2
Using the results from the two models in Question 2.2, generate 100 data sets from the posterior predictive distribution. Draw density curves comparing the simulated data sets to the original data set. Recall the hierarchical/multilevel structure of the model. Write 1-2 sentences per plot about how the simulated data sets compare to the original data set.
Question 4 (10 pts)
For this question, we will use the data set that provides counts of sudden infant deaths (SID) in each of the counties of North Carolina, USA, in the year 1974.
library(sf)
nc_sf <- st_read(system .file("shape/nc .shp" , package="sf"),
quiet=TRUE)
st_crs (nc_sf)
## Coordinate Reference System:
## User input: NAD27
## wkt:
## GEOGCRS["NAD27",
##
##
##
##
##
##
##
##
##
##
##
##
##
DATUM["North American Datum 1927",
ELLIPSOID["Clarke 1866",6378206 .4,294 .978698213898,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0 .0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0 .0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0 .0174532925199433]],
ID["EPSG",4267]]
To construct the expected number of SID per county, use the following R code:
global_rate <- sum (nc_sf$SID74)/sum (nc_sf$BIR74)
nc_sf$Expected <- global_rate * nc_sf$BIR74
We can visualize both the counts and expected counts using tmap:
library(tmap)
tm_shape(nc_sf) + tm_fill("SID74")
SID74
20 to 30 40 to 50 |
tm_shape(nc_sf) + tm_fill("Expected")
Expected 20 to 30 40 to 50 |
Question 4.1
We will fit the BYM model to the SID74 data, but first, for the set of priors β0 ∼ N (0, 1), τv ∼ Gamma(shape = 1, scale = 1), τu ∼ Gamma(shape = 1, scale = 1) simulate 100 data sets from the prior predictive distribution. For counties Ashe, Alleghany, Surry, and Northampton make a histogram of the prior predictive draws and add a vertical line at the observed value.
Question 4.2
Fit the model in INLA using the set of priors in Question 1.1 and using the default priors specified in INLA (a total of 2 models). Organize the results of the estimated parameters (mean and 95% credible intervals) in a table. Include R code used.
Question 4.3
Create three maps of the results for each fitted model.
• Map 1: mean expected count of SID cases across North Carolina.
• Map 2: standard deviation of expected counts of SID cases across North Carolina
• Map 3: mean relative risk