Computing the (simplest) sandwich estimator incrementally

The biglm package in R does {incremental, online, streaming} linear regression for data potentially larger than memory. This isn’t rocket science: accumulating $X^{T} X$ and $X^{T} Y$ is trivial; the package just goes one step better than this by using Alan Miller’s incremental $Q R$ decomposition code to reduce rounding error in ill-conditioned problems.

The code also computes the Huber/White heteroscedasticity-consistent variance estimator (sandwich estimator). Someone wants a reference for this. There isn’t one, because it’s too minor to publish, and I didn’t have a blog ten years ago. But I do now. So:

The Huber/White variance estimator $A^{- 1} B A^{- 1}$ , where $A^{- 1} = (X^{T} X)^{- 1}$ and $B = {(X^{T} (Y - \hat{μ}))}^{\otimes 2}$

The $(i, j)$ element of $B$ is
$\sum_{k = 1}^{N} x_{k i} (y_{k} - x_{k} \hat{β}) x_{k j} (y_{k} - x_{k} \hat{β})$

Multiplying this out, we get
$\sum_{k = 1}^{N} x_{k i} x_{k j} y_{k}^{2}$
and about $2 p$ terms that look like
$\sum_{k = 1}^{N} x_{k i} x_{k j} x_{k ℓ} y_{k} {\hat{β}}_{ℓ}$
and about $p^{2}$ terms that look like
$\sum_{k = 1}^{N} x_{k i} x_{k j} x_{k ℓ} x_{k m} {\hat{β}}_{ℓ} {\hat{β}}_{m}$

We can move the $β$ s outside the sums, so the second sort of terms look like ${\hat{β}}_{ℓ} (\sum_{k = 1}^{N} x_{k i} x_{k j} x_{k ℓ} y_{k})$ and the third sort look like ${\hat{β}}_{ℓ} (\sum_{k = 1}^{N} x_{k i} x_{k j} x_{k ℓ} x_{k m}) {\hat{β}}_{m}$

Now if we define $Z$ to have columns $x_{i} x_{j}$ and $x_{i} y$ (for all $i, j$ ), the matrix $Z^{T} Z$ contains all the $x$ and $y$ pieces needed for $B$ . The obvious thing to do is just to accumulate $Z^{T} Z$ in R code, one chunk at a time.

If you were too convinced of your own cleverness you might realise that $(X, Z)$ could be fed into the $Q R$ decomposition as if it were $X$ , and that you’d get $Z^{T} Z$ For Free! Where ‘for free’ means at $O ((p^{2})^{3})$ extra computing time plus the mental anguish of reconstructing $Z^{T} Z$ from the $Q R$ decomposition. It’s not a big deal, since the computation is dominated by the $O (n p)$ cost of reading the data, but it does look kinda stupid in retrospect.

I suppose that means I’ve learned something in ten years.