# Computing the (simplest) sandwich estimator incrementally

The biglm package in R does {incremental, online, streaming} linear regression for data potentially larger than memory. This isn’t rocket science: accumulating $$X^TX$$ and $$X^TY$$ is trivial; the package just goes one step better than this by using Alan Miller’s incremental $$QR$$ decomposition code to reduce rounding error in ill-conditioned problems.

The code also computes the Huber/White heteroscedasticity-consistent variance estimator (sandwich estimator). Someone wants a reference for this. There isn’t one, because it’s too minor to publish, and I didn’t have a blog ten years ago.  But I do now. So:

The Huber/White variance estimator $$A^{-1}BA^{-1}$$, where $$A^{-1}=(X^TX)^{-1}$$ and $$B=\left(X^T(Y-\hat\mu)\right)^{\otimes 2}$$

The $$(i,j)$$ element of $$B$$ is
$\sum_{k=1}^N x_{ki}(y_{k}-x_{k}\hat\beta)x_{kj}(y_{k}-x_{k}\hat\beta)$

Multiplying this out, we get
$\sum_{k=1}^N x_{ki}x_{kj}y_k^2$
and about $$2p$$ terms that look like
$\sum_{k=1}^N x_{ki}x_{kj}x_{k\ell}y_k\hat\beta_{\ell}$
and about $$p^2$$ terms that look like
$\sum_{k=1}^N x_{ki}x_{kj}x_{k\ell}x_{km}\hat\beta_{\ell}\hat\beta_m$

We can move the $$\beta$$s outside the sums, so the second sort of terms look like $\hat\beta_{\ell}\left(\sum_{k=1}^N x_{ki}x_{kj}x_{k\ell}y_k\right)$ and the third sort look like $\hat\beta_{\ell}\left(\sum_{k=1}^N x_{ki}x_{kj}x_{k\ell}x_{km}\right)\hat\beta_m$

Now if we define $$Z$$ to have columns $$x_ix_j$$ and $$x_iy$$ (for all $$i,\,j$$), the matrix $$Z^TZ$$ contains all the $$x$$ and $$y$$ pieces needed for $$B$$.  The obvious thing to do is just to accumulate $$Z^TZ$$ in R code, one chunk at a time.

If you were too convinced of your own cleverness you might realise that $$(X,Z)$$ could be fed into the $$QR$$ decomposition as if it were $$X$$, and that you’d get $$Z^TZ$$ For Free! Where ‘for free’ means at $$O((p^2)^3)$$ extra computing time plus the mental anguish of reconstructing $$Z^TZ$$ from the $$QR$$ decomposition.  It’s not a big deal, since the computation is dominated by the $$O(np)$$ cost of reading the data, but it does look kinda stupid in retrospect.

I suppose that means I’ve learned something in ten years.