2023 554 SUMMER Package R Notes
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
2023 554 SUMMER Package R Notes
Departments of Biostatistics and Statistics
2023-02-16
Small Area Estimation (SAE)
In these notes, SAE via the SUMMER package will be illustrated.
Details on SUMMER, including a vignette, can be found at https://cran.r-project.org/web/packages/SUMME R/index.html.
We illustrate with the Washington State BRFSS diabetes example and will obtain:
● Naive estimates
● Weighted estimates
● Estimates from a binomial BYM2 model
● Estimates from Fay-Herriot models
Load SUMMER package
We first load the SUMMER package.
This package also depends on INLA, so we need to make sure INLA is installed. Note that INLA is not on CRAN so it has a special installation process. Here, we check if INLA is available and install it if it is not. library (SUMMER)
if (!isTRUE (requireNamespace ( "INLA" , quietly = TRUE))) {
install .packages("INLA" ,
repos = c (getOption("repos"),
INLA="https://inla.r-inla-download.org/R/stable "), dep=TRUE)
}
Read in Data
BRFSS contains the full BRFSS dataset with 16,283 observations:
● diab2 variable is the binary indicator of Type II diabetes
● strata is the strata indicator and
● rwt_llcp is the final weight.
For the purpose of this analysis, we first remove records with missing HRA code or diabetes status from this dataset.
data (BRFSS)
BRFSS <- subset (BRFSS, !is .na (BRFSS$diab2))
BRFSS <- subset (BRFSS, !is .na (BRFSS$hracode))
KingCounty contains the map of the King County HRAs. In order to fit spatial smoothing model, we first need to compute the adjacency matrix for the HRAs, mat, and make sure both the column and row names
correspond to the HRA names.
library(sf) # Load sf for spatial analysis
library(prioritizr) # Allows us to create an adjacency matrix
data(KingCounty)
KingCounty <- st_as_sf(KingCounty)
mat <- adjacency_matrix(KingCounty)
colnames(mat) <- rownames (mat) <- KingCounty$HRA2010v2_
mat <- as .matrix(mat[1:dim(mat)[1], 1:dim(mat)[1]])
mat[1 :2 , 1:2]
## Auburn-North Auburn-South
## Auburn-North 0 1
## Auburn-South 1 0
Direct weighted estimates using the survey package
We load the survey package and then define the survey object for the BRFSS data. We have stratified, disproportionate sampling, so note the arguments:
● weights
● strata
We then calculate the direct (weighted) estimates. These are also known as the Horvitz-Thompson estimates. library(survey)
design <- svydesign(ids = ~ 1 , weights = ~rwt_llcp, strata = ~strata,
data = BRFSS)
direct <- svyby(~diab2, ~hracode, design, svymean)
head(direct, n = 7)
##
## Auburn -North
## Auburn -South
## Ballard
## Beacon/Gtown/S .Park
## Bear Creek/Carnation/Duvall Bear Creek/Carnation/Duvall 0 .05166773 0 .01190146
## Bellevue-Central
## Bellevue-NE
Bellevue-Central 0 .05914082 0 .01485885
Bellevue-NE 0 .05772789 0 .01509705
Binomial spatial smoothing model
We ignore the design and fit the model:
yi Ipi ~ Binomial(ni , pi )
θi = log ╱ 、 = α + bi
with bi following a BYM2 model, i.e., an iid normal random effect and an intrinsic CAR (ICAR) random effect.
The smoothSurvey function in the SUMMER package
The binomial smoothing model is fit with the smoothSurvey function in the SUMMER package by specifying NULL for the survey characteristics, i.e. strata, weights, and cluster variables. For this example we are using a
BYM2 spatial effect, so we include the polygon information and the adjacency matrix in the geo and Amat arguments - this is required for the ICAR component.
smoothed <- smoothSurvey(data = BRFSS, geo = KingCounty, Amat = mat,
responseType = "binary" , responseVar = "diab2" , strataVar = NULL ,
weightVar = NULL , regionVar = "hracode" , clusterVar = NULL ,
CI = 0.95)
The usual INLA summaries can be found in smoothed$fit:
smoothed$fit$summary .fixed
## mean sd 0 .025quant 0 .5quant 0 .975quant mode
## (Intercept) -2 .353603 0 .03303589 -2 .419006 -2 .353481 -2 .288934 -2 .353278
## kld
## (Intercept) 1 .431232e -09
smoothed$fit$summary .hyper
## mean sd 0 .025quant 0 .5quant ## Precision for region .struct 15 . 1709723 4 .949984 7 . 7539676 14 .4056668 ## Phi for region .struct 0 .8558337 0 . 137880 0 .4888591 0 .9009513 ## 0 .975quant mode
## Precision for region .struct 27 .0305684 12 .9899024
## Phi for region .struct 0 .9955581 0 .9923834
Now examine some of the other components:
names (smoothed)
## [1] "HT" "smooth" "smooth .overall" "fit"
## [5] "CI" "Amat" "responseType" "formula"
## [9] "msg"
names (smoothed$HT)
## [1] "region" "HT .est" "HT .var" "HT .logit .est"
## [5] "HT .logit .var" "HT .logit .prec" "n" "y"
names (smoothed$smooth)
"var"
"logit .var"
head(smoothed$HT, n = 4)
## region HT .est
## 1 Auburn-North 0 . 14028777
## 2 Auburn-South 0 .23204420
## 3 Ballard 0 .06666667
## 4 Beacon/Gtown/S .Park 0 .08571429
## HT .logit .prec n y
## 1 33 .52878 278 39
## 2 32 .25414 181 42
## 3 34 .53333 555 37
## 4 16 .45714 210 18
The smoothed estimates of pi and θi can be found in the smooth object returned by the function, and the direct estimates are stored in the HT object (without specifying survey weights, these are the simple binomial probabilities, i.e. naive direct estimates).
head(smoothed$smooth, n = 1)
## region mean var median lower upper logit .mean
## 1 Auburn -North 0 . 1352512 0 .0002478172 0 . 1344345 0 . 1067145 0 . 1684375 - 1 .862334
## logit .var logit .median logit .lower logit .upper
## 1 0 .01796179 - 1 .862617 -2 . 124403 - 1 .598306
head(smoothed$HT, n = 1)
## region HT .est HT .var HT .logit .est HT .logit .var HT .logit .prec
## 1 Auburn -North 0 . 1402878 0 .0004338385 - 1 .812902 0 .02982513 33 .52878
## n y
## 1 278 39
We map the posterior mean estimates for the binomial smoothing model.
data(KingCounty)
toplot <- smoothed$smooth
mapPlot (data = toplot, geo = KingCounty, variables = c ("mean"),
labels = c ( "Posterior Mean"), by .data = "region" , by .geo = "HRA2010v2_ ")
Posterior Mean |
|
We map the naive direct estimates, which are available in the smoothSurvey fit.
toplot$HTest <- smoothed$HT$HT .est
mapPlot (data = toplot, geo = KingCounty, variables = c ( "HTest"),
labels = c ( "Naive Direct Estimates"), by .data = "region" ,
by .geo = "HRA2010v2_ ")
Naive Direct Estimates |
|
Now map the lower and upper endpoints of 95% CI for direct estimates.
lo <- smoothed$HT$HT .est - 1.96 * sqrt(smoothed$HT$HT .var)
hi <- smoothed$HT$HT .est + 1.96 * sqrt(smoothed$HT$HT .var)
toplot$HTlower <- lo
toplot$HTupper <- hi
mapPlot (data = toplot, geo = KingCounty, variables = c ( "HTlower" , "HTupper"), labels = c ( "Naive Direct Lower" , "Naive Direct Upper"),
by.data = "region" , by .geo = "HRA2010v2_ ")
Naive Direct Lower |
|
Value
0.2
0.1
And also map the posterior coefficient of variation (CV) for the smoothed estimates, which we compute as 100 x sd(pi Iy)/匝[pi Iy], i.e. the posterior standard deviation relative to the posterior mean. The CV is often a more easily interpretable uncertainty measure than the posterior standard deviation.
toplot$cv <- sqrt(smoothed$smooth$var)/smoothed$smooth$mean
mapPlot (data = toplot, geo = KingCounty, variables = c ("cv"),
labels = c ( "Posterior CV, Smoothed Estimates"), by .data = "region" ,
by .geo = "HRA2010v2_ ")
Posterior CV, Smoothed Estimates |
|
2023-03-15