7 min read

Sandwiches and aggregation

Demetri Pananos asked a question on Stack Overflow and then on Twitter about the behaviour of the sandwich package for aggregated count data, and Achim Zeileis pinged me1.

If you have a Poisson regression and you have \(N_i\) observations \(Y_{ij}\) that share the same covariates you can aggregate them into a single observation \(Y_{i\cdot}=\sum_j Y_{ij}\) with an offset \(\log N_i\), and note that \[\log E[Y_{ij}|X_{i}]=X_i\beta\] implies \[\log E[Y_{i\cdot}|X_{i},N_i]=X_i\beta+\log N_i.\]

These were actually binary observations, not counts; Demetri was fitting a log-binomial regression (relative risk regression) using a Poisson working model to reduce the influence of outliers compared to using maximum likelihood. The only reason this matters is that if they had been straightforward counts, or binary observations in a logistic regression, he probably would have used maximum likelihood and the model-based standard error estimates. With maximum likelihood and the model-based standard error estimates, you get the same estimated standard errors for the individual-level model and the aggregate model. With the sandwich package vcovHC() estimator, you don’t.

As an example, from the StackOverflow post

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(sandwich)
options(digits=4)
grouped_data <- tibble::tribble(
  ~treatment,        ~g,    ~y,     ~N,
  "A",      "a", 1338L, 20669L,
  "A",      "b",   36L,  1237L,
  "A",      "c", 2555L, 39438L,
  "A",      "d",  402L,  5713L,
  "B",      "a", 1281L, 19986L,
  "B",      "b",   38L,  1224L,
  "B",      "c", 2495L, 36749L,
  "B",      "d",  382L,  5646L
)
yes_outcomes <-grouped_data %>% 
  mutate(yy=1) %>% 
  uncount(y)

no_outcomes <-grouped_data %>% 
  mutate(yy=0) %>% 
  uncount(N-y)

# This is equivalent to the data we had before grouping
unit_data <- bind_rows(yes_outcomes, no_outcomes) %>% 
  select(-y) %>% 
  rename(y=yy)
  
fit <- glm(y ~ treatment + g, data = unit_data, family = poisson)
offset_fit <- glm(y ~ treatment + g, data = grouped_data, family = poisson,
  offset = log(N)) 
  
vcovHC(fit)
##             (Intercept) treatmentB         gb         gc         gd
## (Intercept)   0.0004690 -2.214e-04 -3.600e-04 -3.622e-04 -3.546e-04
## treatmentB   -0.0002214  4.386e-04  5.532e-06  9.844e-06 -5.332e-06
## gb           -0.0003600  5.532e-06  1.347e-02  3.573e-04  3.572e-04
## gc           -0.0003622  9.844e-06  3.573e-04  5.423e-04  3.572e-04
## gd           -0.0003546 -5.332e-06  3.572e-04  3.572e-04  1.545e-03
vcovHC(offset_fit)
##             (Intercept) treatmentB         gb         gc         gd
## (Intercept)    0.001820 -1.568e-03 -1.032e-03 -1.028e-03 -1.033e-03
## treatmentB    -0.001568  3.147e-03 -1.422e-05 -2.244e-05 -1.131e-05
## gb            -0.001032 -1.422e-05  1.978e-03  1.039e-03  1.039e-03
## gc            -0.001028 -2.244e-05  1.039e-03  2.949e-03  1.039e-03
## gd            -0.001033 -1.131e-05  1.039e-03  1.039e-03  3.286e-03

The natural response to seeing this sort of difference is to assume it’s a simple coding problem, so Stack Overflow is a sensible place to try2. However, it’s not a simple coding problem. Getting different answers is not obviously not correct, and we need to think carefully about both what the user should want and what vcovHC should provide.

How could they possibly be different?

The two sandwich estimators are estimators for different semiparametric models. The individual-observation one is for a model where observations are all independent. The aggregated one is for a model where groups are all independent. Those are different assumptions. Since this is all frequentist, the standard errors describe how \(\hat\beta\) would vary in replicates of the data set, and while for any specific data set the \(\hat\beta\) is the same in the two models, the set of replicate data sets you need to consider is different.3

If the mean model is correctly specified4, I think the two sandwich estimators are consistent for the same true variance. But if the mean model isn’t correctly specified, it matters whether there’s really just one observation per covariate pattern or many observations per covariate pattern. In ANOVA terms, the model with aggregated data has the interaction confounded with the residual error, and the model with individual observations doesn’t.

Another way of saying the same thing is to think about bootstraps. The sandwich estimator can be seen as an approximation to the bootstrap5. The sandwich estimator in the offset model is approximating a bootstrap that resamples whole groups; the one in the individual model is approximating a bootstrap that resamples individual observations.

We can verify that the grouping in the sandwich estimator is the issue, by using a clustered sandwich estimator on the individual-level data, thus assuming only cluster-level independence in both settings. The results are now the same:

unit_data$cluster<-with(unit_data, interaction(treatment,g))
vcovCL(fit, cluster=as.numeric(unit_data$cluster))
##             (Intercept) treatmentB         gb         gc         gd
## (Intercept)   1.911e-04 -9.848e-05 -1.416e-04 -1.444e-04 -1.414e-04
## treatmentB   -9.848e-05  1.992e-04 -1.696e-06  3.950e-06 -2.089e-06
## gb           -1.416e-04 -1.696e-06  4.062e-04  1.424e-04  1.424e-04
## gc           -1.444e-04  3.950e-06  1.424e-04  2.332e-04  1.424e-04
## gd           -1.414e-04 -2.089e-06  1.424e-04  1.424e-04  6.718e-04
vcovCL(offset_fit, cluster=NULL)
##             (Intercept) treatmentB         gb         gc         gd
## (Intercept)   1.911e-04 -9.848e-05 -1.416e-04 -1.444e-04 -1.414e-04
## treatmentB   -9.848e-05  1.992e-04 -1.696e-06  3.950e-06 -2.089e-06
## gb           -1.416e-04 -1.696e-06  4.062e-04  1.424e-04  1.424e-04
## gc           -1.444e-04  3.950e-06  1.424e-04  2.332e-04  1.424e-04
## gd           -1.414e-04 -2.089e-06  1.424e-04  1.424e-04  6.718e-04

So the issue genuinely is about what units are independent, and it’s not a coding error

Who’s right?

The individual-observation model is right, because it actually describes how the data were sampled. So either they are both right, or the individual-level model is right and the aggregate model is wrong.

I’m inclined to say that the aggregate model is wrong: that the hypothetical replicates it calculates the variance over are not the relevant set of hypothetical replicates, at least if the mean model is misspecified.

Doing better

Since the point here is to fit a log-binomial model, assumed correctly specified, it might be better to estimate the middle of the sandwich estimator differently. The middle of the sandwich is the variance of the likelihood score. In the usual sandwich estimator this is estimated just by an empirical variance, which is appropriate when you don’t want to assume anything about the distribution of \(Y|X\). Here, though, we know the distribution of \(Y|X\); it’s Bernoulli. We also don’t have any particular reservations about assuming the mean model is correctly specified, as we routinely would for actual Poisson regression or logistic regression

If the \(i\)th element of the score in the Poisson working model is \[U_i = x_i(Y_i-\hat\mu_i)\] the variance is \[\widehat{\mathrm{var}}[U_i]= x_i^T\widehat{\mathrm{var}}[Y_i]x_i=x_i^T\hat\mu_i(1-\hat\mu_i)x_i\] There’s a potentially issue here in that \(\hat\mu\) need not be less than 1 – that’s the reason the Poisson working model approach to relative risk regression is more robust. Still, there can’t be a large fraction of \(\hat\mu>1\) so it may be reasonable to set their contribution to zero.

This variance estimate will be the same for individual and aggregate data

scorebinom<-function(model){
  X<-model.matrix(model)
  N<-1
  logN<-model.offset(model.frame(model))
  if (!is.null(logN)){
      N<-exp(logN)
  }
  mu<-fitted(model)/N
  v<-mu*(1-mu)*N
  crossprod(X, v*X)
}
scorebinom(offset_fit)
##             (Intercept) treatmentB    gb   gc    gd
## (Intercept)     7967.15    3917.51 71.77 4715 729.9
## treatmentB      3917.51    3917.51 36.07 2298 366.4
## gb                71.77      36.07 71.77    0   0.0
## gc              4715.23    2298.09  0.00 4715   0.0
## gd               729.88     366.45  0.00    0 729.9
scorebinom(fit)
##             (Intercept) treatmentB    gb   gc    gd
## (Intercept)     7967.15    3917.51 71.77 4715 729.9
## treatmentB      3917.51    3917.51 36.07 2298 366.4
## gb                71.77      36.07 71.77    0   0.0
## gc              4715.23    2298.09  0.00 4715   0.0
## gd               729.88     366.45  0.00    0 729.9

And now variances

binomsandwich<-function(model){
  variability<-scorebinom(model)
  sensitivity<-vcov(model)
  sensitivity%*%variability%*%sensitivity
} 
binomsandwich(fit)
##             (Intercept) treatmentB         gb         gc         gd
## (Intercept)   0.0004657 -2.181e-04 -3.560e-04 -3.592e-04 -3.560e-04
## treatmentB   -0.0002181  4.384e-04 -2.438e-06  4.052e-06 -2.402e-06
## gb           -0.0003560 -2.438e-06  1.346e-02  3.572e-04  3.572e-04
## gc           -0.0003592  4.052e-06  3.572e-04  5.422e-04  3.572e-04
## gd           -0.0003560 -2.402e-06  3.572e-04  3.572e-04  1.545e-03
binomsandwich(offset_fit)
##             (Intercept) treatmentB         gb         gc         gd
## (Intercept)   0.0004657 -2.181e-04 -3.560e-04 -3.592e-04 -3.560e-04
## treatmentB   -0.0002181  4.384e-04 -2.438e-06  4.052e-06 -2.402e-06
## gb           -0.0003560 -2.438e-06  1.346e-02  3.572e-04  3.572e-04
## gc           -0.0003592  4.052e-06  3.572e-04  5.422e-04  3.572e-04
## gd           -0.0003560 -2.402e-06  3.572e-04  3.572e-04  1.545e-03

These are the same. They are also reassuringly close to the individual-level one from before, vcovHC(fit).

diag(binomsandwich(fit))
## (Intercept)  treatmentB          gb          gc          gd 
##   0.0004657   0.0004384   0.0134643   0.0005422   0.0015447
diag(vcovHC(fit))
## (Intercept)  treatmentB          gb          gc          gd 
##   0.0004690   0.0004386   0.0134749   0.0005423   0.0015451

Further Research Is Needed – we’d like to know how well these actually perform before recommending them.


  1. we both worked on sandwich estimators when we were Very Young↩︎

  2. if you can get past the site’s general reluctance to actually answer questions rather than argue about how you asked the wrong question↩︎

  3. Yes, this problem would go away, and be replaced by completely different problems, if you were Bayesian↩︎

  4. and also the observations really are independent↩︎

  5. I think this is in an Appendix of the Efron & Tibshirani bootstrap book?↩︎