如何解决从 svyglm() 模型获得协变量调整的患病率估计值
所以我对使用调查数据计算协变量调整的流行率很感兴趣。我的问题与特定的专有数据集有关,但以下是使用 data(nhanes)
数据的示例。对于帖子的长度,我提前道歉......
第 1 步
我拟合了两个回归模型:(1) 一个原始模型和 (2) 一个调整后的模型
#Setup
if(!require (survey)) install.packages("survey")
library(survey)
#Read in data
data("nhanes")
#Create survey design
d.obj = svydesign(id = ~SDMVPSU,strata = ~SDMVSTRA,weights = ~WTMEC2YR,nest = TRUE,data = nhanes)
#Regression models
mod.1 = svyglm(formula = HI_CHOL ~ as.factor(agecat),design = d.obj,family = quasibinomial(link = "log"))
mod.2 = svyglm(formula = HI_CHOL ~ as.factor(agecat) + as.factor(race) + as.factor(RIAGENDR),family = quasibinomial(link = "log"))
第 2 步
我对系数求幂以获得流行率和 95% CI。下面的摘要显示了“NA”线上方的原始模型的 PR 和该线下方的调整模型的 PR。
#Regression Summary
mod.sum = rbind(cbind("Estimate" = exp(coef(mod.1)),exp(confint(mod.1))),c(NA,NA,NA),cbind("Estimate" = exp(coef(mod.2)),exp(confint(mod.2))))
> mod.sum
Estimate 2.5 % 97.5 %
(Intercept) 0.008660272 0.004735950 0.01583638
as.factor(agecat)(19,39] 9.109573963 4.818424917 17.22229550
as.factor(agecat)(39,59] 20.610647336 10.441625642 40.68320376
as.factor(agecat)(59,Inf] 17.932147454 9.227318351 34.84890193
NA NA NA
(Intercept) 0.008597533 0.004623207 0.01598838
as.factor(agecat)(19,39] 9.051050139 4.816409068 17.00883531
as.factor(agecat)(39,59] 20.472879940 10.415264252 40.24274401
as.factor(agecat)(59,Inf] 17.641599353 9.050357148 34.38825923
as.factor(race)2 0.939905542 0.824583833 1.07135550
as.factor(race)3 0.693581683 0.536698646 0.89632339
as.factor(race)4 0.876065546 0.494248225 1.55284491
as.factor(RIAGENDR)2 1.210035838 1.048810926 1.39604451
第 3 步
按年龄类别对 HI_CHOL
的流行率估计相当简单...
#Calculting prevalence of HI_CHOL by agecat
prv = svyby(formula = ~HI_CHOL,by = ~agecat,FUN = svymean,vartype = "ci",na.rm = TRUE)
> prv
agecat HI_CHOL ci_l ci_u
(0,19] (0,19] 0.008660267 0.003433241 0.01388729
(19,39] (19,39] 0.078891392 0.061116023 0.09666676
(39,59] (39,59] 0.178493821 0.156964219 0.20002342
(59,Inf] (59,Inf] 0.155297283 0.130664250 0.17993032
使用 prv
的点估计值,如果我使用 mod.1
作为参考组,我可以手动计算 agecat == (0,19]
的点估计值。但是,我希望计算每个年龄类别的协变量调整流行率(这样,如果我再次使用 mod.2
作为参考组,我可以类似地手动计算 agecat == (0,19]
的点估计值)。我不知道该怎么做。
在 Stata 中,我可以运行 margins
命令来产生这样的估计(或者我相信,但是当我在我的专有数据上运行该命令时,置信区间的下限是负的,我不是确定如何解释它们)。
. svyset SDMVPSU [pweight = WTMEC2YR],strata(SDMVSTRA) singleunit(centered)
pweight: WTMEC2YR
VCE: linearized
Single unit: centered
Strata 1: SDMVSTRA
SU 1: SDMVPSU
FPC 1: <zero>
. svy: glm HI_CHOL i.agecat i.race i.RIAGENDR,family(binomial) link(log) eform allbase
(running glm on estimation sample)
Survey: Generalized linear models
Number of strata = 15 Number of obs = 7,846
Number of PSUs = 31 Population size = 255,345,910
Design df = 16
------------------------------------------------------------------------------
| Linearized
HI_CHOL | exp(b) Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
agecat |
(0,19] | 1 (base)
(19,39] | 9.051065 2.917803 6.83 0.000 4.569875 17.92648
(39,59] | 20.47291 7.054992 8.76 0.000 9.86093 42.50513
(59,Inf] | 17.64163 5.994874 8.45 0.000 8.583895 36.25709
|
race |
1 | 1 (base)
2 | .9399055 .0616344 -0.95 0.359 .8179215 1.080082
3 | .6935817 .0908676 -2.79 0.013 .5253873 .9156208
4 | .8760655 .2516647 -0.46 0.651 .4764973 1.610693
|
RIAGENDR |
1 | 1 (base)
2 | 1.210036 .0872042 2.65 0.018 1.0386 1.409769
|
_cons | .0085975 .0027195 -15.04 0.000 .0043969 .0168111
------------------------------------------------------------------------------
. margins agecat,expression(exp(predict(xb))) vce(unconditional)
Predictive margins
Number of strata = 15 Number of obs = 7,910
Design df = 16
Expression : exp(predict(xb))
------------------------------------------------------------------------------
| Linearized
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
agecat |
(0,19] | .0087239 .002677 3.26 0.005 .0030489 .0143989
(19,39] | .0789607 .009204 8.58 0.000 .059449 .0984724
(39,59] | .1786039 .0107848 16.56 0.000 .1557412 .2014666
(59,Inf] | .153904 .0129656 11.87 0.000 .1264183 .1813898
------------------------------------------------------------------------------
附加上下文
在我上面的 nhanes
示例中,原始模型和调整模型之间的点估计值没有显着差异。然而,对于我的专有数据,类似的“原始”模型更像是一个“最小调整模型”,它采用 outcome ~ exposure + covar
的形式,其中 covar
是一个结构协变量,在设置调查设计。如果我按传统方式按暴露组计算结果患病率,而不考虑 covar
,那么得出的估计值将没有意义。鉴于此,我知道如何计算患病率的唯一方法是基于回归模型(就像我在 Stata 中所做的那样),但我不知道如何在 R 中执行此操作,并且非常想学习。
更新 3/21/2021
所以我相信我已经通过代码解决了这个问题,但是解释结果让我摸不着头脑。简而言之,我可以按曝光分组,并对协变量的每个排列的指数模型拟合值求和。我能够使用 nhanes
执行此操作,如下所示:
#Setup
library(survey)
library(magrittr)
library(dplyr)
library(tidyverse)
library(broom)
#Read in data
data("nhanes")
#Factoring covariates
nhanes$race %<>% factor()
nhanes$RIAGENDR %<>% factor()
#Create survey design
d.obj = svydesign(id = ~SDMVPSU,data = nhanes)
#Regression models
mod.1 = svyglm(formula = HI_CHOL ~ agecat,family = quasibinomial(link = "log"))
mod.2 = svyglm(formula = HI_CHOL ~ agecat + race + RIAGENDR,family = quasibinomial(link = "log"))
#Regression Summary
mod.sum =
list(Model.1 = data.frame(tidy(mod.1,exponentiate = TRUE,conf.int = TRUE)),Model.2 = data.frame(tidy(mod.2,conf.int = TRUE)))
#Appending regression fitted values to data frame
df.1 = augment(mod.1)
df.2 = augment(mod.2)
#Estimating probabilities for each permutation of regression formula
x.prob.1 = aggregate(cbind(est= exp(.fitted)) ~ agecat,data = df.1,FUN = mean)
x.prob.2 = aggregate(cbind(est= exp(.fitted)) ~ race + RIAGENDR + agecat,data = df.2,FUN = mean)
#Estimating probability of outcome given exposure,independant of covars
tab.x.1 = aggregate(est~ agecat,data = x.prob.1,FUN = sum)
tab.x.1 %<>% mutate(rr = est/est[1])
tab.x.2 = aggregate(est~ agecat,data = x.prob.2,FUN = sum)
tab.x.2 %<>% mutate(rr = est/est[1])
#Summarizing tab.x
tab.x.sum = list(Model.1 = tab.x.1,Model.2 = tab.x.2)
#Estimating prevalence through traditional method
prev = svyby(formula = ~HI_CHOL,na.rm = TRUE)
汇总对象的结果如下...
> mod.sum
$Model.1
term estimate std.error statistic p.value conf.low conf.high
1 (Intercept) 0.008660272 0.3079463 -15.421547 9.802627e-10 0.00473595 0.01583638
2 agecat(19,39] 9.109573963 0.3249442 6.799094 1.264216e-05 4.81842492 17.22229550
3 agecat(39,59] 20.610647336 0.3469490 8.721189 8.585730e-07 10.44162564 40.68320376
4 agecat(59,Inf] 17.932147454 0.3389994 8.515045 1.122087e-06 9.22731835 34.84890193
$Model.2
term estimate std.error statistic p.value conf.low conf.high
1 (Intercept) 0.008597533 0.31652962 -15.0263347 1.111045e-07 0.004623207 0.01598838
2 agecat(19,39] 9.051050139 0.32186925 6.8440237 7.523606e-05 4.816409068 17.00883531
3 agecat(39,59] 20.472879940 0.34481686 8.7556654 1.068803e-05 10.415264252 40.24274401
4 agecat(59,Inf] 17.641599353 0.34054477 8.4284357 1.455588e-05 9.050357148 34.38825923
5 race2 0.939905542 0.06678723 -0.9279602 3.776433e-01 0.824583833 1.07135550
6 race3 0.693581683 0.13083519 -2.7965432 2.082994e-02 0.536698646 0.89632339
7 race4 0.876065546 0.29204774 -0.4530573 6.612323e-01 0.494248225 1.55284491
8 RIAGENDR2 1.210035838 0.07295691 2.6131862 2.812343e-02 1.048810926 1.39604451
> tab.x.sum
$Model.1
agecat est rr
1 (0,19] 0.008660272 1.000000
2 (19,39] 0.078891392 9.109574
3 (39,59] 0.178493821 20.610647
4 (59,Inf] 0.155297283 17.932147
$Model.2
agecat est rr
1 (0,19] 0.0666845 1.00000
2 (19,39] 0.6035648 9.05105
3 (39,59] 1.3652238 20.47288
4 (59,Inf] 1.1764213 17.64160
> prev
agecat HI_CHOL se
(0,19] 0.008660267 0.002666899
(19,39] 0.078891392 0.009069233
(39,59] 0.178493821 0.010984693
(59,Inf] 0.155297283 0.012568105
这种方法存在一些(如果不是很多)问题。首先,来自 R 的对数二项式点估计与来自 STATA 的不一致,尽管这可能归因于潜在的程序差异。其次,对于 R,虽然流行率估计值及其相应的流行率比率相加,但调整后模型的估计值没有意义。例如,tab.x.sum$Model.2
的流行率估计值高于 1.00!这让我相信,虽然我已经在数学/代码方面解决了我的问题,但我产生的估计量与我想象的完全不同。
对我实际计算的任何见解都会有所帮助...!
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。