3 min read

# Stochastic SVD

Suppose you have an $$m\times n$$ matrix $$A$$ of rank $$k$$. If $$\Omega$$ is an $$n\times k$$ matrix with iid standard Gaussian entries, then $$\Omega$$ will have rank $$k$$ with probability 1,  $$A\Omega$$ will have rank $$k$$ with probability 1, and so $$A\Omega$$ spans the range of $$A$$. That’s all easy.

More impressively, if $$A=\tilde A+\epsilon$$ where $$\tilde A$$ has rank $$k$$ and $$\epsilon$$ has small norm, and if $$\Omega$$ has $$k+p$$ columns, $$A\Omega$$ spans the range of $$\tilde A$$ with high probability, for surprisingly small values of $$p$$.  If $$Q$$ comes from a $$QR$$ decomposition of $$A\Omega$$, then $$Q^TA$$ has approximately the same $$k$$ largest singular values as $$A$$ (or, equivalently, as $$\tilde A$$).

The same trick works with lots of other random $$\Omega$$, for fixed $$p$$. If we are prepared to take $$p=\log k$$ it even works for a ‘structured’ random $$\Omega=SHR$$ where $$R$$ applies random signs to each row, $$H$$ does the Hadamard Transform, and $$S$$ takes a random sample of $$k\log k$$ rows of a matrix.  The point of this choice of $$\Omega$$ is that $$\Omega A$$ can be computed in $$mn\log n$$ time  (with a small constant, and for any $$k$$) using the Fast Hadamard Transform, rather than the $$mnk$$ time of explicit matrix multiplication.

The reference you want is here: “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions” by Halko, Martinsson, and Tropp.

One way to think about what’s happening is a combination of “Space is Big. Really Big.” with a version of the Law of Large Numbers. The columns of $$A$$ are $$n$$ points in $$m$$-dimensional space, and if $$m\gg\log n$$ they are really sparse. Because they are really sparse, capturing one eigenvector of $$A$$ is pretty much independent of capturing another one. $$\Omega$$ doesn’t have any preferences for whether it captures eigenvectors with large or small eigenvalues, but $$A\Omega$$ magnifies the larger ones. As the paper notes, multiplying by $$(AA^T)^q$$ for some small $$q$$ improves things even further.  If you only took $$k$$ dimensions you’d still have a good chance of missing some of the $$k$$ main eigenvectors, but using $$k+p$$ dimensions decreases that chance – more or less exponentially in $$p$$ because independence. The actual proofs, of course, are more complicated and use some fairly deep facts.

The stochastic SVD is no faster on average than Lanczos-type algorithms, but it’s enormously easier to program correctly and comes with simple probabilistic error bounds that are known to be sharp. As with the Lanczos-type algorithms, it can be made much faster if $$A$$ is sparse or otherwise structured so that multiplication by $$A$$ is fast.  The stochastic algorithm is also easier to parallelize and can be modified to scan $$A$$ only a few times, or even only once.