The biglm
package in R does {incremental, online, streaming} linear regression for data potentially larger than memory. This isn’t rocket science: accumulating and is trivial; the package just goes one step better than this by using Alan Miller’s incremental decomposition code to reduce rounding error in ill-conditioned problems.
The code also computes the Huber/White heteroscedasticity-consistent variance estimator (sandwich estimator). Someone wants a reference for this. There isn’t one, because it’s too minor to publish, and I didn’t have a blog ten years ago. But I do now. So:
The Huber/White variance estimator , where and
The element of is
Multiplying this out, we get
and about terms that look like
and about terms that look like
We can move the s outside the sums, so the second sort of terms look like and the third sort look like
Now if we define to have columns and (for all ), the matrix contains all the and pieces needed for . The obvious thing to do is just to accumulate in R code, one chunk at a time.
If you were too convinced of your own cleverness you might realise that could be fed into the decomposition as if it were , and that you’d get For Free! Where ‘for free’ means at extra computing time plus the mental anguish of reconstructing from the decomposition. It’s not a big deal, since the computation is dominated by the cost of reading the data, but it does look kinda stupid in retrospect.
I suppose that means I’ve learned something in ten years.