从 svyglm() 模型获得协变量调整的患病率估计值

如何解决从 svyglm() 模型获得协变量调整的患病率估计值

所以我对使用调查数据计算协变量调整的流行率很感兴趣。我的问题与特定的专有数据集有关，但以下是使用 data(nhanes) 数据的示例。对于帖子的长度，我提前道歉......

第 1 步

我拟合了两个回归模型：(1) 一个原始模型和 (2) 一个调整后的模型

#Setup
if(!require (survey)) install.packages("survey")
library(survey)

#Read in data
data("nhanes")

#Create survey design
d.obj = svydesign(id = ~SDMVPSU,strata = ~SDMVSTRA,weights = ~WTMEC2YR,nest = TRUE,data = nhanes)

#Regression models
mod.1 = svyglm(formula = HI_CHOL ~ as.factor(agecat),design = d.obj,family = quasibinomial(link = "log"))

mod.2 = svyglm(formula = HI_CHOL ~ as.factor(agecat) + as.factor(race) + as.factor(RIAGENDR),family = quasibinomial(link = "log"))

第 2 步

我对系数求幂以获得流行率和 95% CI。下面的摘要显示了“NA”线上方的原始模型的 PR 和该线下方的调整模型的 PR。

#Regression Summary
mod.sum = rbind(cbind("Estimate" = exp(coef(mod.1)),exp(confint(mod.1))),c(NA,NA,NA),cbind("Estimate" = exp(coef(mod.2)),exp(confint(mod.2))))

> mod.sum
                              Estimate        2.5 %      97.5 %
(Intercept)                0.008660272  0.004735950  0.01583638
as.factor(agecat)(19,39]   9.109573963  4.818424917 17.22229550
as.factor(agecat)(39,59]  20.610647336 10.441625642 40.68320376
as.factor(agecat)(59,Inf] 17.932147454  9.227318351 34.84890193
                                    NA           NA          NA
(Intercept)                0.008597533  0.004623207  0.01598838
as.factor(agecat)(19,39]   9.051050139  4.816409068 17.00883531
as.factor(agecat)(39,59]  20.472879940 10.415264252 40.24274401
as.factor(agecat)(59,Inf] 17.641599353  9.050357148 34.38825923
as.factor(race)2           0.939905542  0.824583833  1.07135550
as.factor(race)3           0.693581683  0.536698646  0.89632339
as.factor(race)4           0.876065546  0.494248225  1.55284491
as.factor(RIAGENDR)2       1.210035838  1.048810926  1.39604451

第 3 步

按年龄类别对 HI_CHOL 的流行率估计相当简单...

#Calculting prevalence of HI_CHOL by agecat
prv = svyby(formula = ~HI_CHOL,by = ~agecat,FUN = svymean,vartype = "ci",na.rm = TRUE)

> prv
           agecat     HI_CHOL        ci_l       ci_u
(0,19]     (0,19] 0.008660267 0.003433241 0.01388729
(19,39]   (19,39] 0.078891392 0.061116023 0.09666676
(39,59]   (39,59] 0.178493821 0.156964219 0.20002342
(59,Inf] (59,Inf] 0.155297283 0.130664250 0.17993032

使用 prv 的点估计值，如果我使用 mod.1 作为参考组，我可以手动计算 agecat == (0,19] 的点估计值。但是，我希望计算每个年龄类别的协变量调整流行率（这样，如果我再次使用 mod.2 作为参考组，我可以类似地手动计算 agecat == (0,19] 的点估计值）。我不知道该怎么做。

在 Stata 中，我可以运行 margins 命令来产生这样的估计（或者我相信，但是当我在我的专有数据上运行该命令时，置信区间的下限是负的，我不是确定如何解释它们）。

. svyset SDMVPSU [pweight = WTMEC2YR],strata(SDMVSTRA) singleunit(centered)

      pweight: WTMEC2YR
          VCE: linearized
  Single unit: centered
     Strata 1: SDMVSTRA
         SU 1: SDMVPSU
        FPC 1: <zero>

. svy: glm HI_CHOL i.agecat i.race i.RIAGENDR,family(binomial) link(log) eform allbase
(running glm on estimation sample)

Survey: Generalized linear models

Number of strata   =        15                Number of obs     =        7,846
Number of PSUs     =        31                Population size   =  255,345,910
                                              Design df         =           16

------------------------------------------------------------------------------
             |             Linearized
     HI_CHOL |     exp(b)   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      agecat |
     (0,19]  |          1  (base)
    (19,39]  |   9.051065   2.917803     6.83   0.000     4.569875    17.92648
    (39,59]  |   20.47291   7.054992     8.76   0.000      9.86093    42.50513
   (59,Inf]  |   17.64163   5.994874     8.45   0.000     8.583895    36.25709
             |
        race |
          1  |          1  (base)
          2  |   .9399055   .0616344    -0.95   0.359     .8179215    1.080082
          3  |   .6935817   .0908676    -2.79   0.013     .5253873    .9156208
          4  |   .8760655   .2516647    -0.46   0.651     .4764973    1.610693
             |
    RIAGENDR |
          1  |          1  (base)
          2  |   1.210036   .0872042     2.65   0.018       1.0386    1.409769
             |
       _cons |   .0085975   .0027195   -15.04   0.000     .0043969    .0168111
------------------------------------------------------------------------------

. margins agecat,expression(exp(predict(xb))) vce(unconditional)

Predictive margins

Number of strata   =        15                Number of obs     =        7,910
                                              Design df         =           16

Expression   : exp(predict(xb))

------------------------------------------------------------------------------
             |             Linearized
             |     Margin   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      agecat |
     (0,19]  |   .0087239    .002677     3.26   0.005     .0030489    .0143989
    (19,39]  |   .0789607    .009204     8.58   0.000      .059449    .0984724
    (39,59]  |   .1786039   .0107848    16.56   0.000     .1557412    .2014666
   (59,Inf]  |    .153904   .0129656    11.87   0.000     .1264183    .1813898
------------------------------------------------------------------------------

附加上下文

在我上面的 nhanes 示例中，原始模型和调整模型之间的点估计值没有显着差异。然而，对于我的专有数据，类似的“原始”模型更像是一个“最小调整模型”，它采用 outcome ~ exposure + covar 的形式，其中 covar 是一个结构协变量，在设置调查设计。如果我按传统方式按暴露组计算结果患病率，而不考虑 covar，那么得出的估计值将没有意义。鉴于此，我知道如何计算患病率的唯一方法是基于回归模型（就像我在 Stata 中所做的那样），但我不知道如何在 R 中执行此操作，并且非常想学习。

更新 3/21/2021

所以我相信我已经通过代码解决了这个问题，但是解释结果让我摸不着头脑。简而言之，我可以按曝光分组，并对协变量的每个排列的指数模型拟合值求和。我能够使用 nhanes 执行此操作，如下所示：

#Setup
library(survey)
library(magrittr)
library(dplyr)
library(tidyverse)
library(broom)

#Read in data
data("nhanes")

#Factoring covariates
nhanes$race %<>% factor()
nhanes$RIAGENDR %<>% factor()

#Create survey design
d.obj = svydesign(id = ~SDMVPSU,data = nhanes)

#Regression models
mod.1 = svyglm(formula = HI_CHOL ~ agecat,family = quasibinomial(link = "log"))

mod.2 = svyglm(formula = HI_CHOL ~ agecat + race + RIAGENDR,family = quasibinomial(link = "log"))

#Regression Summary
mod.sum =
  list(Model.1 = data.frame(tidy(mod.1,exponentiate = TRUE,conf.int = TRUE)),Model.2 = data.frame(tidy(mod.2,conf.int = TRUE)))

#Appending regression fitted values to data frame
df.1 = augment(mod.1)
df.2 = augment(mod.2)

#Estimating probabilities for each permutation of regression formula
x.prob.1 = aggregate(cbind(est= exp(.fitted)) ~ agecat,data = df.1,FUN = mean)
x.prob.2 = aggregate(cbind(est= exp(.fitted)) ~ race + RIAGENDR + agecat,data = df.2,FUN = mean)

#Estimating probability of outcome given exposure,independant of covars
tab.x.1 = aggregate(est~ agecat,data = x.prob.1,FUN = sum)
tab.x.1 %<>% mutate(rr = est/est[1])

tab.x.2 = aggregate(est~ agecat,data = x.prob.2,FUN = sum)
tab.x.2 %<>% mutate(rr = est/est[1])

#Summarizing tab.x
tab.x.sum = list(Model.1 = tab.x.1,Model.2 = tab.x.2)

#Estimating prevalence through traditional method
prev = svyby(formula = ~HI_CHOL,na.rm = TRUE)

汇总对象的结果如下...

> mod.sum
$Model.1
            term     estimate std.error  statistic      p.value    conf.low   conf.high
1    (Intercept)  0.008660272 0.3079463 -15.421547 9.802627e-10  0.00473595  0.01583638
2  agecat(19,39]  9.109573963 0.3249442   6.799094 1.264216e-05  4.81842492 17.22229550
3  agecat(39,59] 20.610647336 0.3469490   8.721189 8.585730e-07 10.44162564 40.68320376
4 agecat(59,Inf] 17.932147454 0.3389994   8.515045 1.122087e-06  9.22731835 34.84890193

$Model.2
            term     estimate  std.error   statistic      p.value     conf.low   conf.high
1    (Intercept)  0.008597533 0.31652962 -15.0263347 1.111045e-07  0.004623207  0.01598838
2  agecat(19,39]  9.051050139 0.32186925   6.8440237 7.523606e-05  4.816409068 17.00883531
3  agecat(39,59] 20.472879940 0.34481686   8.7556654 1.068803e-05 10.415264252 40.24274401
4 agecat(59,Inf] 17.641599353 0.34054477   8.4284357 1.455588e-05  9.050357148 34.38825923
5          race2  0.939905542 0.06678723  -0.9279602 3.776433e-01  0.824583833  1.07135550
6          race3  0.693581683 0.13083519  -2.7965432 2.082994e-02  0.536698646  0.89632339
7          race4  0.876065546 0.29204774  -0.4530573 6.612323e-01  0.494248225  1.55284491
8      RIAGENDR2  1.210035838 0.07295691   2.6131862 2.812343e-02  1.048810926  1.39604451

> tab.x.sum
$Model.1
    agecat         est        rr
1   (0,19] 0.008660272  1.000000
2  (19,39] 0.078891392  9.109574
3  (39,59] 0.178493821 20.610647
4 (59,Inf] 0.155297283 17.932147

$Model.2
    agecat       est       rr
1   (0,19] 0.0666845  1.00000
2  (19,39] 0.6035648  9.05105
3  (39,59] 1.3652238 20.47288
4 (59,Inf] 1.1764213 17.64160

> prev
           agecat     HI_CHOL          se
(0,19] 0.008660267 0.002666899
(19,39] 0.078891392 0.009069233
(39,59] 0.178493821 0.010984693
(59,Inf] 0.155297283 0.012568105

这种方法存在一些（如果不是很多）问题。首先，来自 R 的对数二项式点估计与来自 STATA 的不一致，尽管这可能归因于潜在的程序差异。其次，对于 R，虽然流行率估计值及其相应的流行率比率相加，但调整后模型的估计值没有意义。例如，tab.x.sum$Model.2 的流行率估计值高于 1.00！这让我相信，虽然我已经在数学/代码方面解决了我的问题，但我产生的估计量与我想象的完全不同。

对我实际计算的任何见解都会有所帮助...！

从 svyglm() 模型获得协变量调整的患病率估计值

如何解决从 svyglm() 模型获得协变量调整的患病率估计值

相关推荐