5 min read

Faster generalised linear models in largeish data

There basically isn’t an algorithm for generalised linear models that computes the maximum likelihood estimator in a single pass over the \(N\) observatons in the data. You need to iterate.  The bigglm function in the biglm package does the iteration using bounded memory, by reading in the data in chunks, and starting again at the beginning for each iteration. That works, but it can be slow, especially if the database server doesn’t communicate that fast with your R process.

There is, however, a way to cheat slightly. If we had a good starting value \(\tilde\beta\), we’d only need one iteration – and all the necessary computation for a single iteration can be done in a single database query that returns only a small amount of data.  It’s well known that if \(\|\tilde\beta-\beta\|=O_p(N^{-1/2})\), the estimator resulting from one step of Newton–Raphson is fully asymptotically efficient. What’s less well known is that for simple models like glms, we only need \(\|\tilde\beta-\beta\|=o_p(N^{-1/4})\).

There’s not usually much advantage in weakening the assumption that way, because in standard asymptotics for well-behaved finite-dimensional parametric models, any reasonable starting estimator will be \(\sqrt{N}\)-consistent if it’s consistent at all. In the big-data setting, though, there’s a definite advantage: a starting estimator based on a bit more than \(N^{1/2}\) observations will have error less than \(N^{-1/4}\).  More concretely, if we sample \(n=N^{5/9}\) observations and compute the full maximum likelihood estimator, we end up with a starting estimator \(\tilde\beta\) satisfying \[\|\tilde\beta-\beta\|=O_p(n^{-1/2})=O_p(N^{-5/18})=o_p(N^{-1/4}).\]

The proof is later, because you don’t want to read it. The basic idea is doing a Taylor series expansion and showing the remainder is \(O_p(\|\tilde\beta-\beta\|^2)\), not just \(o_p(\|\tilde\beta-\beta\|).\)

This approach should be faster than bigglm, because it only needs one and a bit iterations, and because the data stays in the database. It doesn’t scale as far as bigglm, because you need to be able to handle \(n\) observations in memory, but with \(N\) being a billion, \(n\) is only a hundred thousand. 

The database query is fairly straightforward because the efficient score in a generalised linear model is of the form 
\[\sum_{i=1}^N x_iw_i(y_i-\mu_i)\]
for some weights \(w_i\). Even better, \(w_i=1\) for the most common models. We do need an exponentiation function, which isn’t standard SQL, but is pretty widely supplied. 

So, how well does it work? On my ageing Macbook Air, I did a 1.7-million-record logistic regression to see if red cars are faster. More precisely, using the “passenger car/van” records from the NZ vehicle database, I fit a regression model where the outcome was being red and the predictors were vehicle mass, power, and number of seats. More powerful engines, fewer seats, and lower mass were associated with the vehicle being red. Red cars are faster.

The computation time was 1.4s for the sample+one iteration approach and 15s for bigglm.

Now I’m working on  an analysis of the NYC taxi dataset, which is much bigger and has more interesting variables.  My first model, with 87 million records, was a bit stressful for my laptop. It took nearly half an hour elapsed time for the sample+one-step approach and 41 minutes for bigglm, though bigglm took about three times as long in CPU time.  I’m going to try on my desktop to see how the comparison goes there.  Also, this first try was using the in-process MonetDBLite database, which will make bigglm look good, so I should also try a setting where the data transfer between R and the database actually needs to happen. 

I’ll be talking about this at the JSM and at useR.

arXiv

package

JCGS

Math stuff

Suppose we are fitting a generalised linear model with regression parameters \(\beta\), outcome \(Y\), and predictors \(X\).  Let \(\beta_0\) be the true value of \(\beta\), \(U_N(\beta)\) be the score at \(\beta\) on \(N\) observations and \(I_N(\beta)\) theFisher information at \(\beta\) on \(N\) observations. Assume the second partial derivatives of the loglikelihood have uniformly bounded second moments on a compact neighbourhood \(K\) of \(\beta_0\). Let \(\Delta_3\) be the tensor of third partial derivatives of the log likelihood, and assume its elements

\[(\Delta_3)_{ijk}=\frac{\partial^3}{\partial x_i\partial x_jx\partial _k}\log\ell(Y;X,\beta)\]
have uniformly bounded second moments on \(K\).

Theorem:  Let \(n=N^{\frac{1}{2}+\delta}\) for some \(\delta\in (0,1/2]\), and let \(\tilde\beta\) be the maximum likelihood estimator of \(\beta\) on a subsample of size \(n\).  The one-step estimators
\[\hat\beta_{\textrm{full}}= \tilde\beta + I_N(\tilde\beta)^{-1}U_N(\tilde\beta)\]
and
\[\hat\beta= \tilde\beta + \frac{n}{N}I_n(\tilde\beta)^{-1}U_N(\tilde\beta)\]
are first-order efficient

Proof: The score function at the true parameter value is of the form
\[U_N(\beta_0)=\sum_{i=1}^Nx_iw_i(\beta_0)(y_i-\mu_i(\beta_0)\]
By the mean-value form of Taylor’s theorem we have
\[U_N(\beta_0)=U_N(\tilde\beta)+I_N(\tilde\beta)(\tilde\beta-\beta_0)+\Delta_3(\beta^*)(\tilde\beta-\beta_0,\tilde\beta-\beta_0)\]
where \(\beta^*\) is on the interval between \(\tilde\beta\) and \(\beta_0\). With probability 1, \(\tilde\beta\) and thus \(\beta^*\) is in \(K\) for all sufficiently large \(n\), so the remainder term is \(O_p(Nn^{-1})=o_p(N^{1/2})\).
Thus
\[I_N^{-1}(\tilde\beta) U_N(\beta_0) = I^{-1}_N(\tilde\beta)U_N(\tilde\beta)+\tilde\beta-\beta_0+o_p(N^{-1/2})\]

Let \(\hat\beta_{MLE}\) be the maximum likelihood estimator. It is a standard result that
\[\hat\beta_{MLE}=\beta_0+I_N^{-1}(\beta_0) U_N(\beta_0)+o_p(N^{-1/2})\]

So
\[\begin{eqnarray*} \hat\beta_{MLE}&=& \tilde\beta+I^{-1}_N(\tilde\beta)U_N(\tilde\beta)+o_p(N^{-1/2})\\\ &=& \hat\beta_{\textrm{full}}+o_p(N^{-1/2}) \end{eqnarray*}\]

Now, define \(\tilde I(\tilde\beta)=\frac{N}{n}I_n(\tilde\beta)\), the estimated full-sample information based on the subsample, and let \({\cal I}(\tilde\beta)=E_{X,Y}\left[N^{-1}I_N\right]\) be the expected per-observation information.  By the Central Limit Theorem we have  
\[I_N(\tilde\beta)=I_n(\tilde\beta)+(N-n){\cal I}(\tilde\beta)+O_p((N-n)n^{-1/2}),\]
so
\[I_N(\tilde\beta) \left(\frac{N}{n}I_n(\tilde\beta)\right)^{-1}=\mathrm{Id}_p+ O_p(n^{-1/2})\]
where \(\mathrm{Id}_p\) is the \(p\times p\) identity matrix.
We have
\[\begin{eqnarray*} \hat\beta-\tilde\beta&=&(\hat\beta_{\textrm{full}}-\tilde\beta)I_N(\tilde\beta)^{-1} \left(\frac{N}{n}I_n(\tilde\beta)\right)\\\ &=&(\hat\beta_{\textrm{full}}-\tilde\beta)\left(\mathrm{Id}_p+ O_p(n^{-1/2}\right)\\\ &=&(\hat\beta_{\textrm{full}}-\tilde\beta)+ O_p(n^{-1}) \end{eqnarray*}\]
so \(\hat\beta\) (without the \(\textrm{full}\))is also asymptotically efficient.