Suppose you have an matrix of rank . If is an matrix with iid standard Gaussian entries, then will have rank with probability 1, will have rank with probability 1, and so spans the range of . That’s all easy.
More impressively, if where has rank and has small norm, and if has columns, spans the range of with high probability, for surprisingly small values of . If comes from a decomposition of , then has approximately the same largest singular values as (or, equivalently, as ).
The same trick works with lots of other random , for fixed . If we are prepared to take it even works for a ‘structured’ random where applies random signs to each row, does the Hadamard Transform, and takes a random sample of rows of a matrix. The point of this choice of is that can be computed in time (with a small constant, and for any ) using the Fast Hadamard Transform, rather than the time of explicit matrix multiplication.
The reference you want is here: “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions” by Halko, Martinsson, and Tropp.
One way to think about what’s happening is a combination of “Space is Big. Really Big.” with a version of the Law of Large Numbers. The columns of are points in -dimensional space, and if they are really sparse. Because they are really sparse, capturing one eigenvector of is pretty much independent of capturing another one. doesn’t have any preferences for whether it captures eigenvectors with large or small eigenvalues, but magnifies the larger ones. As the paper notes, multiplying by for some small improves things even further. If you only took dimensions you’d still have a good chance of missing some of the main eigenvectors, but using dimensions decreases that chance – more or less exponentially in because independence.
The stochastic SVD is no faster on average than Lanczos-type algorithms, but it’s enormously easier to program correctly and comes with simple probabilistic error bounds that are known to be sharp. As with the Lanczos-type algorithms, it can be made much faster if is sparse or otherwise structured so that multiplication by is fast. The stochastic algorithm is also easier to parallelize and can be modified to scan only a few times, or even only once.