Chapter 5 Applying Covariate Adjustment: ANCOVA & G-Computation

In this next section, we will use R to produce the ANCOVA estimator, get an estimate of its standard error, and compare its precision to an unadjusted analysis using R. For this example we will use the Buprenorphine tapering trial data (CTN-03).

data_url <-
  "https://github.com/jbetz-jhu/CovariateAdjustmentTutorial/raw/main/SIMULATED_CTN03_220506.Rdata"

load(file = url(data_url))

This .Rdata file contains two datasets: ctn03_sim, which has no missing data, and ctn03_sim_mar, where a simulated missing data mechanism has been applied.

The stats::lm function can be used to fit a linear regression model, such as the ANCOVA. While this estimator is conditional on the baseline value of the outcome, because the model is linear with an identity link, the conditional treatment effect is the same as the marginal treatment effect. When binary outcomes or nonlinear link functions are used in a regression model, the marginal and conditional associations are almost always different.

In order to generalize the ANCOVA to other settings, a different approach must be taken to compute the marginal treatment effect. We will use the steps outlined earlier in the discussion of the ANCOVA estimator.


5.1 Computing the Estimate

To compute the ANCOVA estimator, first regress the final outcome on the treatment assignment indicator and the baseline value of the outcome:

# 1. Regress final outcome (VAS at end of taper) on treatment assignment and
# outcome assessed at baseline (VAS at baseline)
vas_ancova_1 <-
  lm(
    formula = vas_crave_opiates_eot ~ arm + vas_crave_opiates_bl,
    data = ctn03_sim_mar
  )

summary(vas_ancova_1)
## 
## Call:
## lm(formula = vas_crave_opiates_eot ~ arm + vas_crave_opiates_bl, 
##     data = ctn03_sim_mar)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.193 -21.747  -6.874  15.151  71.637 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          24.05363    2.27269  10.584  < 2e-16 ***
## arm7-day             -2.17932    2.82195  -0.772     0.44    
## vas_crave_opiates_bl  0.35910    0.07595   4.728 3.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.96 on 350 degrees of freedom
##   (163 observations deleted due to missingness)
## Multiple R-squared:  0.06006,    Adjusted R-squared:  0.05469 
## F-statistic: 11.18 on 2 and 350 DF,  p-value: 1.962e-05

Note that the standard error reported by lm() is the model-based standard error, not a robust standard error or bootstrap standard error. Next, we generate the predicted outcome for each randomized individual under each treatment assignment:

# 2. Generate predictions based on fitted model:
# Their predicted outcome if they were assigned to treatment (7-day taper)
# Their predicted outcome if they were assigned to control (28-day taper)
expected_vas_7_day <-
  predict(
    object = vas_ancova_1,
    newdata =
      # Set all treatment to "7-day"
      within(
        data = ctn03_sim_mar,
        expr = {arm = "7-day"}
      ),
    type = "response"
  )

expected_vas_28_day <-
  predict(
    object = vas_ancova_1,
    newdata =
      # Set all treatment to "28-day"
      within(
        data = ctn03_sim_mar,
        expr = {arm = "28-day"}
      ),
    type = "response"
  )


data.frame(
  expected_vas_7_day,
  expected_vas_28_day,
  ctn03_sim_mar[, c("arm", "vas_crave_opiates_eot", "vas_crave_opiates_bl")]
) %>% 
  head()
##   expected_vas_7_day expected_vas_28_day   arm vas_crave_opiates_eot
## 1           27.97901            30.15833 7-day                    NA
## 2           24.38801            26.56733 7-day                     5
## 3           32.64731            34.82663 7-day                    31
## 4           24.38801            26.56733 7-day                    40
## 5           28.69721            30.87653 7-day                    51
## 6           26.18351            28.36283 7-day                     1
##   vas_crave_opiates_bl
## 1                   17
## 2                    7
## 3                   30
## 4                    7
## 5                   19
## 6                   12

Once we’ve generated these predictions, we can compute the ANCOVA estimate of the average treatment effect:

mean(expected_vas_7_day)
## [1] 26.88292
mean(expected_vas_28_day)
## [1] 29.06224
vas_ancova_estimate <- mean(expected_vas_7_day) - mean(expected_vas_28_day)
vas_ancova_estimate
## [1] -2.179321

Notice this is the exact same as the regression coefficient for arm in the regression model, since there’s no treatment-covariate interactions and an identity link function.


5.2 Standard Error & Inference

We can obtain the standard error of the estimate two ways. The first way is using the margins::margins() command, using the robust standard errors from sandwich::vcovHC:

vas_ancova_margins <-
  margins::margins(
    model = vas_ancova_1,
    # Specify treatment variable
    variables = "arm",
    # Convert to outcome scale, not link scale
    type = "response",
    # Obtain robust standard errors
    vcov = sandwich::vcovHC(x = vas_ancova_1, type = "HC3")
  )

summary(object = vas_ancova_margins, level = 0.95)
##    factor     AME     SE       z      p   lower  upper
##  arm7-day -2.1793 2.8680 -0.7599 0.4473 -7.8005 3.4418

You’ll see that we now have a standard error, p-value under the hypothesis that the marginal effect is 0, and a 95% Confidence Interval for the estimate. Another way to obtain these is the bias corrected and accelerated (BCa) non-parametric bootstrap:

# Write a function to produce the ANCOVA estimate
margins_fun <-
  function(
    data,
    indices = NULL,
    formula,
    family,
    term,
    contrast = c("difference", "ratio")[1]
  ){
    # Input data must be a data.frame
    if(!all(class(data) == "data.frame")){
      stop("`data` must be a data.frame: use `as.data.frame()` for a tibble.")
    }
    
    # If bootstrap indices not supplied, use entire dataset
    if(is.null(indices)) indices <- 1:nrow(data)
    
    data <- data[indices,]
    
    glm_fit <-
      glm(
        formula = formula,
        family = family,
        data = data
      )
    
    tx_levels <- levels(data[, term])
    
    e_y_1 <-
      predict(
        object = glm_fit,
        newdata = 
          within(
            data,
            expr = assign(x = term, value = tx_levels[2])
          ),
        
        type = "response"
      )
    
    e_y_0 <-
      predict(
        object = glm_fit,
        newdata = 
          within(
            data,
            expr = assign(x = term, value = tx_levels[1])
          ),
        
        type = "response"
      )
    
    if(contrast == "difference"){
      return(mean(e_y_1) - mean(e_y_0))
    } else if (contrast == "ratio"){
      return(mean(e_y_1)/mean(e_y_0))
    }
  }

vas_ancova_boot <-
  boot::boot(
    data = ctn03_sim_mar,
    statistic = margins_fun,
    R = 10000,
      formula = vas_crave_opiates_eot ~ arm + vas_crave_opiates_bl,
    family = gaussian(link = "identity"),
    term = "arm"
  )
# Bootstrap results
vas_ancova_boot
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = ctn03_sim_mar, statistic = margins_fun, R = 10000, 
##     formula = vas_crave_opiates_eot ~ arm + vas_crave_opiates_bl, 
##     family = gaussian(link = "identity"), term = "arm")
## 
## 
## Bootstrap Statistics :
##      original    bias    std. error
## t1* -2.179321 0.0631477    2.816499
# Bootstrap Standard Error
sd(vas_ancova_boot$t[,1])
## [1] 2.816499

We can extract the standard error and produce a 95% confidence interval:

vas_ancova_boot_ci <-
  boot::boot.ci(
    boot.out = vas_ancova_boot,
    conf = 0.95,
    type = "bca"
  )
vas_ancova_boot_ci

We can compare these to the results from margins::margins():

# Margins SE
ancova_margins_se <- summary(object = vas_ancova_margins, level = 0.95)$SE
ancova_margins_se
## [1] 2.867989
# Bootstrap SE
ancova_boot_se <- sd(vas_ancova_boot$t[,1])
ancova_boot_se
## [1] 2.816499
# Compare as ratio
ancova_margins_se/ancova_boot_se
## [1] 1.018282

The standard errors are nearly identical using the estimated marginal effect with robust standard errors or the bootstrap.


5.3 Calculating Precision Gain

Asymptotically, the ANCOVA estimate should have precision that is equal or better than an unadjusted analysis. We can calculate the precision gain by taking the ratio of the variances, i.e. the squared standard errors, of the adjusted estimator to the unadjusted estimator.

vas_t_test <-
  t.test(
    formula = vas_crave_opiates_eot ~ arm,
    data = ctn03_sim_mar,
    var.equal = FALSE
  )

vas_t_test
## 
##  Welch Two Sample t-test
## 
## data:  vas_crave_opiates_eot by arm
## t = 0.093598, df = 316.77, p-value = 0.9255
## alternative hypothesis: true difference in means between group 28-day and group 7-day is not equal to 0
## 95 percent confidence interval:
##  -5.418971  5.960313
## sample estimates:
## mean in group 28-day  mean in group 7-day 
##             28.01325             27.74257
# Get unadjusted standard error
t_test_se <- vas_t_test$stderr
t_test_se
## [1] 2.891841

The precision gain is equal to the ratio of the variance, which is the square of the standard error:

# Percentage reduction in variance adjusting for baseline outcome
100*(1 - (ancova_margins_se/t_test_se)^2)
## [1] 1.642819

The precision gain in this particular example is rather small, but we have only adjusted for one potential covariate. However, there are several other baseline variables we did not include in the model. We can see if the gain in precision is larger when these other variables are included. We should also include the stability dose in the outcome, since randomization was stratified by this variable:

vas_ancova_2 <-
  glm(
    formula = vas_crave_opiates_eot ~ 
      arm + vas_crave_opiates_bl + stability_dose +
      arsw_score_bl + cows_total_score_bl +
      vas_current_withdrawal_bl + vas_study_tx_help_bl + uds_any_positive_bl,
    data = ctn03_sim_mar
  )

vas_ancova_margins_2 <-
  margins::margins(
    model = vas_ancova_2,
    # Specify treatment variable
    variables = "arm",
    # Convert to outcome scale, not link scale
    type = "response",
    # Obtain robust standard errors
    vcov = sandwich::vcovHC(x = vas_ancova_2, type = "HC3")
  )

ancova_margins_2_se <- summary(object = vas_ancova_margins_2, level = 0.95)$SE

# Percentage reduction in variance adjusting for baseline outcome
100*(1 - (ancova_margins_2_se/t_test_se)^2)
## [1] 7.590248

Including these other covariates leads to a larger gain in precision.


5.4 Covariate Adjustment with Binary Outcomes

When the outcome is binary, the only difference in the approach is fitting a logistic regression with a nonlinear link instead of multiple regression with an identity link:

uds_opioids_glm <-
  glm(
    formula = 
      uds_opioids_eot == "Negative" ~ 
      arm + uds_opioids_bl + stability_dose +
      arsw_score_bl + cows_total_score_bl +
      vas_current_withdrawal_bl + vas_study_tx_help_bl,
    data = ctn03_sim_mar,
    family = binomial(link = "logit")
  )

summary(uds_opioids_glm)
## 
## Call:
## glm(formula = uds_opioids_eot == "Negative" ~ arm + uds_opioids_bl + 
##     stability_dose + arsw_score_bl + cows_total_score_bl + vas_current_withdrawal_bl + 
##     vas_study_tx_help_bl, family = binomial(link = "logit"), 
##     data = ctn03_sim_mar)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.019885   0.759146   0.026  0.97910    
## arm7-day                   0.257685   0.265805   0.969  0.33232    
## uds_opioids_blPositive    -2.607045   0.297263  -8.770  < 2e-16 ***
## stability_dose16 mg        1.168914   0.459625   2.543  0.01098 *  
## stability_dose24 mg        1.170684   0.428484   2.732  0.00629 ** 
## arsw_score_bl             -0.002279   0.012769  -0.178  0.85837    
## cows_total_score_bl        0.187241   0.110601   1.693  0.09047 .  
## vas_current_withdrawal_bl -0.015190   0.013636  -1.114  0.26529    
## vas_study_tx_help_bl       0.001393   0.006972   0.200  0.84165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 467.70  on 352  degrees of freedom
## Residual deviance: 362.37  on 344  degrees of freedom
##   (163 observations deleted due to missingness)
## AIC: 380.37
## 
## Number of Fisher Scoring iterations: 4

Logistic regression gives coefficients on the log odds scale, which can be exponentiated to obtain a conditional odds ratio, a relative measure. The margins::margins function gives the marginal estimate on the risk difference scale, which is an absolute measure:

uds_opioids_margins <-
  margins::margins(
    model = uds_opioids_glm,
    # Specify treatment variable
    variables = "arm",
    # Convert to outcome scale, not link scale
    type = "response",
    # Obtain robust standard errors
    vcov = sandwich::vcovHC(x = uds_opioids_glm, type = "HC3")
  )

summary(uds_opioids_margins)
##    factor    AME     SE      z      p   lower  upper
##  arm7-day 0.0433 0.0469 0.9238 0.3556 -0.0486 0.1353
confint(uds_opioids_margins)
##                lower     upper
## arm7-day -0.04860374 0.1352633

A risk ratio could be obtained by using margins_fun from above, specifying contrast = "ratio", and obtaining the standard error and confidence interval via the bootstrap. In the event that a marginal odds ratio is of interest, specifying type = "link" in the margins::margins call gives a marginal log odds ratio, which could be exponentiated to obtain a marginal odds ratio.


5.5 Stratified Randomization & Covariate Adjustment

When stratified randomization is employed, additional benefits in precision can be realized by using covariate adjustment and a variance estimator designed for covariate adaptive designs. First, source the ICAD code from Github:

icad_link <-
  "https://raw.githubusercontent.com/BingkaiWang/covariate-adaptive/master/R/ICAD.R"
source(url(icad_link))

Next, we specify which covariates are included in the model, excluding those in the strata variable, and call ICAD:

baseline_covariates <-
  c("uds_opioids_bl", "arsw_score_bl", "cows_total_score_bl",
    "vas_current_withdrawal_bl", "vas_study_tx_help_bl")

vas_icad <-
  ICAD(
    Y = 1*(ctn03_sim_mar$uds_opioids_eot == "Negative"), 
    A = 1*(ctn03_sim_mar$arm == "7-day"), # Treatment indicator - Must be 1/0
    Strata = ctn03_sim_mar$stability_dose, # Randomization Stratum
    W = ctn03_sim_mar[, baseline_covariates], # Baseline Covariates
    pi = 0.5, # 1:1 Randomization,
    family = "binomial"
  )
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
vas_icad
##                            est         var    CI.lower  CI.upper
## unadjsed            0.07068389 0.002601561 -0.02928505 0.1706528
## adjusted-for-strata 0.07374103 0.002601386 -0.02622455 0.1737066
## adjusted-for-all    0.04332980 0.001889907 -0.04187583 0.1285354
## drwls               0.04427818 0.002044182 -0.04433696 0.1328933

We can see that the ‘adjusted-for-all’ estimator, which uses a specialized variance estimator for covariate adaptive designs, has a lower variance than the covariate adjusted estimator obtained using margins, even though we are adjusting for the same covariates. This is the result of the additional precision gains from the covariate-adaptive variance estimator. The doubly robust weighted least squares (drwls) estimator will be discussed later on.

Specifying family = "binomial" allows for marginal effects with binary outcomes:

baseline_covariates <-
  c("vas_crave_opiates_bl", "arsw_score_bl",
    "cows_total_score_bl", "vas_current_withdrawal_bl", "vas_study_tx_help_bl",
    "uds_any_positive_bl")

uds_icad <-
  ICAD(
    Y = ctn03_sim_mar$vas_crave_opiates_eot, 
    A = 1*(ctn03_sim_mar$arm == "7-day"), # Treatment indicator - Must be 1/0
    Strata = ctn03_sim_mar$stability_dose, # Randomization Stratum
    W = ctn03_sim_mar[, baseline_covariates], # Baseline Covariates
    pi = 0.5 # 1:1 Randomization
  )

uds_icad
##                            est      var  CI.lower CI.upper
## unadjsed            -0.2706708 8.068289 -5.837896 5.296555
## adjusted-for-strata -0.5224149 8.068084 -6.089570 5.044740
## adjusted-for-all    -0.9106801 6.581495 -5.938855 4.117495
## drwls               -0.9485718 6.956200 -6.117900 4.220757

5.6 Missing Outcome Data: Inverse Weighting

So far, we have not discussed the issue of missing outcome data. While the G-computation estimator is robust to arbitrary model misspecification, it is only valid if data are missing completely at random (MCAR): missingness is unrelated to either the observed or unobserved data. If this were the case, we should not see any association between the baseline covariates and missingness in the VAS opiate craving scores at the end-of-taper visit. We can assess this using a generalized additive model (GAM):

vas_mar_glm <-
  mgcv::gam(
    formula = 
      is.na(vas_crave_opiates_eot) ~
      arm + stability_dose +
      s(age) + sex +
      s(vas_crave_opiates_bl) + s(arsw_score_bl) + s(cows_total_score_bl) +
      s(vas_current_withdrawal_bl) + s(vas_study_tx_help_bl) +
      uds_any_positive_bl,
    family = binomial(link = "logit"),
    data = ctn03_sim_mar
  )

summary(vas_mar_glm)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## is.na(vas_crave_opiates_eot) ~ arm + stability_dose + s(age) + 
##     sex + s(vas_crave_opiates_bl) + s(arsw_score_bl) + s(cows_total_score_bl) + 
##     s(vas_current_withdrawal_bl) + s(vas_study_tx_help_bl) + 
##     uds_any_positive_bl
## 
## Parametric coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -1.80921    0.56096  -3.225  0.00126 ** 
## arm7-day                    -1.01189    0.20978  -4.824 1.41e-06 ***
## stability_dose16 mg          1.74449    0.57581   3.030  0.00245 ** 
## stability_dose24 mg          1.46083    0.56381   2.591  0.00957 ** 
## sexFemale                    0.11516    0.22339   0.516  0.60620    
## uds_any_positive_blPositive -0.07541    0.21609  -0.349  0.72711    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                edf Ref.df Chi.sq p-value   
## s(age)                       1.758  2.210  1.947  0.4526   
## s(vas_crave_opiates_bl)      1.000  1.000  1.436  0.2308   
## s(arsw_score_bl)             1.000  1.000  7.327  0.0068 **
## s(cows_total_score_bl)       3.110  3.839  6.372  0.1944   
## s(vas_current_withdrawal_bl) 2.120  2.692  7.016  0.0561 . 
## s(vas_study_tx_help_bl)      1.000  1.000  1.542  0.2143   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0853   Deviance explained = 9.61%
## UBRE = 0.18961  Scale est. = 1         n = 516

The GAM model shows that, all other things being equal, missingness was lower in the 7-day arm, higher in individuals on 16 or 24 mg stability doses, and was associated with some of the baseline outcome measures. In the next section, we will discuss methods that can be robust to model misspecification and do not require the MCAR assumption.