The biglm
package in R does {incremental, online, streaming} linear regression for data potentially larger than memory. This isn’t rocket science: accumulating XTX and XTY is trivial; the package just goes one step better than this by using Alan Miller’s incremental QR decomposition code to reduce rounding error in ill-conditioned problems.
The code also computes the Huber/White heteroscedasticity-consistent variance estimator (sandwich estimator). Someone wants a reference for this. There isn’t one, because it’s too minor to publish, and I didn’t have a blog ten years ago. But I do now. So:
The Huber/White variance estimator A−1BA−1, where A−1=(XTX)−1 and B=(XT(Y−ˆμ))⊗2
The (i,j) element of B is
N∑k=1xki(yk−xkˆβ)xkj(yk−xkˆβ)
Multiplying this out, we get
N∑k=1xkixkjy2k
and about 2p terms that look like
N∑k=1xkixkjxkℓykˆβℓ
and about p2 terms that look like
N∑k=1xkixkjxkℓxkmˆβℓˆβm
We can move the βs outside the sums, so the second sort of terms look like ˆβℓ(N∑k=1xkixkjxkℓyk) and the third sort look like ˆβℓ(N∑k=1xkixkjxkℓxkm)ˆβm
Now if we define Z to have columns xixj and xiy (for all i,j), the matrix ZTZ contains all the x and y pieces needed for B. The obvious thing to do is just to accumulate ZTZ in R code, one chunk at a time.
If you were too convinced of your own cleverness you might realise that (X,Z) could be fed into the QR decomposition as if it were X, and that you’d get ZTZ For Free! Where ‘for free’ means at O((p2)3) extra computing time plus the mental anguish of reconstructing ZTZ from the QR decomposition. It’s not a big deal, since the computation is dominated by the O(np) cost of reading the data, but it does look kinda stupid in retrospect.
I suppose that means I’ve learned something in ten years.