Survey weights are common in large-scale government-funded data collections. For example, NHIS and NHANES are two large scale surveys that track the health and well-being of Americans that have survey weights. These data collections use complex and multi-stage survey sampling to ensure that results are representative of the U.S. population. Although use of survey weights is sometimes contested in regression analyses, they are needed for simple means and proportions. The general guidance is that if analysts can control for the factors that were used to create the weights in their analyses, then using weights might not be necessary and will inflate standard errors (and therefore p-values). In some of my analyses, however, there are variables used to create weights that I do not have access to such as geographic variables or specific household demographics, and therefore, I use the weights.
Below is code that will get you from start to finish using survey weights to produce means and a regression. I put explanations of what is happening within each block of code.
Simply open RStudio (with the latest version of R installed), paste this into a new R script, run each section of code, and voila, you just used survey weights in R!
Try it out!
## Packages --------------------------------------------------------------------
# For reading SAS XPT file from NHANES website
# haven::read_xpt
library(haven)
# For using survey weights
# survey::svydesign, svymean, svyglm
library(survey)
# For data wrangling
#dplyr::select, mutate, select, recode
library(dplyr)
## Load the dataset ------------------------------------------------------------
# Import NHANES demographic data
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.xpt"))
## Data Wrangling --------------------------------------------------------------
# Copy and rename variables so they are more intuitive. "fpl" is percent of the
# of the federal poverty level. It ranges from 0 to 5.
nhanesDemo$fpl <- nhanesDemo$INDFMPIR
nhanesDemo$age <- nhanesDemo$RIDAGEYR
nhanesDemo$gender <- nhanesDemo$RIAGENDR
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
nhanesDemo$psu <- nhanesDemo$SDMVPSU
nhanesDemo$strata <- nhanesDemo$SDMVSTRA
# Since there are 47 variables, we will select only the variables we will use in
# this analysis.
nhanesAnalysis <- nhanesDemo %>%
select(fpl,
age,
gender,
persWeight,
psu,
strata)
# Recode gender
nhanesAnalysis <- nhanesAnalysis %>%
mutate(gender = recode(gender, `1` = 0L,
`2` = 1L)
)
# Convert "gender" to a factor variable. We need to do this so it isn't treated
# as a continuous variable in our analyses
nhanesAnalysis$gender <- as.factor(nhanesAnalysis$gender)
## Survey Weights --------------------------------------------------------------
# Here we use "svydesign" to assign the weights. We will use this new design
# variable "nhanesDesign" when running our analyses.
nhanesDesign <- svydesign(id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
# Here we use "subset" to tell "nhanesDesign" that we want to only look at a
# specific subpopulation (i.e., those age between 18-79 years). This is
# important to do. If you don't do this and just restrict it in a different way
# your estimates won't have correct SEs.
ageDesign <- subset(nhanesDesign, age > 17 &
age < 80)
## Statistics ------------------------------------------------------------------
# We will use "svymean" to calculate the population mean for age. The na.rm
# argument "TRUE" excludes missing values from the calculation. We see that
# the mean age is 45.648 and the standard error is 0.5131.
svymean(~age, ageDesign, na.rm = TRUE)
# Since gender is a factor variable, "svymean" will treat it as such and give us
# the proportion of women. We see that men are 48.601% and woman are 51.399% of
# the population in this age of 18 to 79.
svymean(~gender, ageDesign, na.rm = TRUE)
# Now we will run a general linear model (glm) with a gaussian link function.
# We tell svyglm that nhanesAnalysis is the dataset to use and to apply the
# "svydesign" object "ageDesign." I won't dive into the results here, but you
# can see that age is positively correlated with FPL and that women are
# predicted to have a lower FPL than men.
output <- svyglm(fpl ~ age +
gender,
family = gaussian(),
data = nhanesAnalysis,
design = ageDesign)
# Get a summary of the regression output
summary(output)