Crossed clustering and parallel invention

This week, I was prompted to find some old R code

xeffect.glm<-function(glm.obj,g1,g2){
    if (!exists("rowsum")) require(survival4)
    umat<-estfun.glm(glm.obj)
    usum1<-rowsum(umat,g1,reorder=F)
    usum2<-rowsum(umat,g2,reorder=F)
    g1a<-as.numeric(as.factor(g1))
    g2a<-as.numeric(as.factor(g2))
    g12<-g1a*(1+max(g2a))+g2a
    usum12<-rowsum(umat,g12,reorder=F)
    utu<-(t(usum1)%*%usum1)+t(usum2)%*%usum2-t(usum12)%*%usum12
    modelv<-summary(glm.obj)$cov.unscaled
    modelv%*%utu%*%modelv
}

You can tell it’s old: the use of F rather than FALSE indicates I was still using S-PLUS reasonably often, and it was a time when rowsum wasn’t in base R (or was only recently). I could find the code because it’s been sitting on my old UW web page, which they kindly haven’t taken down yet. The tarball files are dated August 1998; I’m fairly sure some of the code even pre-dates CRAN.

The code computes standard error estimates for a generalised linear model with clustering on two crossed factors. This sort of data was the original (1996) idea for my PhD, and I did cover standard error estimation for this problem there, though not the more-general problem of GEE-type estimation that I had originally planned; I was diverted to space and time correlation.

The variance estimation idea is fairly easy. The way I’d phrase it now is that you want to find the variance of \(\hat\beta\) so you represent \(\hat\beta\) in terms of its influence functions \[\sqrt{n}(\hat\beta-\beta_0)=\frac{1}{n}\sum_{i=1}^n h_i(\beta_0)+o_p(1).\] Now, \[n\mathrm{var}[\hat\beta] \approx \sum_{i,j} \mathrm{cov}[h_i(\beta_0),h_j(\beta_0)]\] and \[\widehat{\mathrm{var}}[\hat\beta]=\frac{1}{n}\sum_{i,j\textrm{ correlated}} h_i(\hat\beta)h^T_j(\hat\beta)\] Back then, I thought in terms of sandwich estimators, where \(h_i=I^{-1}U_i\) with \(U_i\) the estimating function and \(I\) the derivative matrix that would be the Fisher information if this was maximum likelihood. The middle of the sandwich is \[\hat J=\frac{1}{n}\sum_{i,j\textrm{ correlated}} U_i(\hat\beta)U^T_j(\hat\beta)\]

You need the restriction in the sum to \((i,j)\) correlated. If you sum over all pairs \((i,j)\), the estimator collapses to \((\sum_i U_i(\hat\beta))^{\otimes 2}\), which is identically zero. Even if you could evaluate at \(\beta=\beta_0\) instead of \(\beta=\hat\beta\) it wouldn’t work: you’d get an unbiased estimator, but not a consistent one. Basically, you need the sum to be over \(o(n^2)\) terms – or, to get some leverage for proofs, \(O(n^{2-\delta})\) terms. In terms of clustering, you need the number of clusters to go to infinity, though the cluster size can also grow. The proof is by Chebyshev’s inequality and just counting things, together with (as always in Statistics) a second-order Taylor series expansion. In practice, as the asymptotics suggests, you need the number of terms in the sum to be a fairly small fraction of the \(n^2\) possible terms. You don’t need the correlation to be generated by crossed random effects, which was important to me at the time though in fact most examples of this sort of sparse correlation do come from crossed random effects.

When you do have crossed sets of clustering, there’s an interesting choice to make. Suppose (as in the S code) you have clustering on \(g_1\) and on \(g_2\). An easy computational approach (avoiding quadratic time or space computations) is to compute variance matrices \(V_1\) clustering on \(g_1\), \(V_2\) clustering on \(g_2\), and \(V_{12}\) clustering on the combinations of \(g_1\) and \(g_2\). The variance estimator \(V_1+V_2-V_{12}\) is unbiased (well, would be unbiased if computed at \(\beta_0\)), but it is not necessarily positive definite. The estimator \(V_1+V_2\) is biased upwards but is necessarily positive definite. The bias is small if the crossed factors are nearly orthogonal, but if they’re correlated – eg, primary school district and secondary school district –the bias can be quite big. The bias would approach a factor of 2 if \(g_1\) and \(g_2\) were nearly identical. The code above uses the unbiased estimator, but there’s some argument that the biased estimator would typically be better since sandwich estimators tend to underestimate in practice.

Apart from being in my thesis in 1998, this approach led to a few papers by me and other Seattle people: one with Nicole Mayer-Hamblett and Steve Self, a couple by Patrick Heagerty and Diana Miglioretti, one by Patrick with some political science researchers. Patrick has used the biased but guaranteed-positive version of the estimator.

I looked for the code because someone had been trying to do this sort of variance estimation with the survey package – that’s not going to work, because multistage sampling is about nested rather than crossed effects. There’s apparently code for Stata to do this sort of thing, and a nice paper in the Journal of Business and Economic Statistics inventing the idea and applying it to economics examples. They use the unbiased version of the estimator. I wouldn’t be very surprised if other groups of people had also independently come up with the same idea – it’s a straightforward-enough extension of the basic Huber/White/Domowitz sandwich estimators. I’m not even arguing that it would be better if people didn’t reinvent statistical ideas; at some point it’s easier just to solve a problem than to think how someone else in another field might have described their idea for solving an analogous problem. We probably could use a few more generalists in academic statistics, but that’s a balance that’s not so easy to get right.

Some form of this will probably be in the next version of the survey package, but I need to decide how general and with what user interface.