3 min read

Stochastic SVD

Suppose you have an m×n matrix A of rank k. If Ω is an n×k matrix with iid standard Gaussian entries, then Ω will have rank k with probability 1,  AΩ will have rank k with probability 1, and so AΩ spans the range of A. That’s all easy.

More impressively, if A=A~+ϵ where A~ has rank k and ϵ has small norm, and if Ω has k+p columns, AΩ spans the range of A~ with high probability, for surprisingly small values of p.  If Q comes from a QR decomposition of AΩ, then QTA has approximately the same k largest singular values as A (or, equivalently, as A~). 

The same trick works with lots of other random Ω, for fixed p. If we are prepared to take p=logk it even works for a ‘structured’ random Ω=SHR where R applies random signs to each row, H does the Hadamard Transform, and S takes a random sample of klogk rows of a matrix.  The point of this choice of Ω is that ΩA can be computed in mnlogn time  (with a small constant, and for any k) using the Fast Hadamard Transform, rather than the mnk time of explicit matrix multiplication. 

The reference you want is here: “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions” by Halko, Martinsson, and Tropp. 

One way to think about what’s happening is a combination of “Space is Big. Really Big.” with a version of the Law of Large Numbers. The columns of A are n points in m-dimensional space, and if mlogn they are really sparse. Because they are really sparse, capturing one eigenvector of A is pretty much independent of capturing another one. Ω doesn’t have any preferences for whether it captures eigenvectors with large or small eigenvalues, but AΩ magnifies the larger ones. As the paper notes, multiplying by (AAT)q for some small q improves things even further.  If you only took k dimensions you’d still have a good chance of missing some of the k main eigenvectors, but using k+p dimensions decreases that chance – more or less exponentially in p because independence. The actual proofs, of course, are more complicated and use some fairly deep facts.

The stochastic SVD is no faster on average than Lanczos-type algorithms, but it’s enormously easier to program correctly and comes with simple probabilistic error bounds that are known to be sharp. As with the Lanczos-type algorithms, it can be made much faster if A is sparse or otherwise structured so that multiplication by A is fast.  The stochastic algorithm is also easier to parallelize and can be modified to scan A only a few times, or even only once.