4 min read

Sparse correlation and the Central Limit Theorem

Back when I was a PhD student working on generalisations of GEE, I was interested in ‘sparse’ correlations, defined by ‘most’ small sets of observations being independent. One way to get this structure is from crossed clustering variables; another is for the basic units in your analysis to be pairs (or larger tuples) of observations. If you drew a graph with the variables as vertices, and connected correlated variables, then two subsets of the variables would be independent of each other if there was no edge between them.

To see what this implies in large samples, consider a balanced incomplete multi-rater experiment with K raters and D objects to be rated, in which each rater rates 1<dD of the objects and each object is rated by 1<kK of the raters. The number of ratings is n=Kd=kD. Two ratings are connected by an edge in the dependence graph if they share a rating or share a rating. The maximal degree of the graph is M=k+d1. We are interested in the case where K and D are both large.

Consider a simple parametric random-effects model for the rating Yij by rater i of object j Yij=μ+ai+bj+ϵij where μ is a fixed constant, aiN(0,σa2), bjN(0,σb2), and ϵijN(0,σϵ2). The variance of Sn=ijYij is σn2=nσϵ2+Kd2σa2+k2Dσb2=nσϵ2+ndσa2+nkσb2.

The ratio σn2/Mn will be bounded above and bounded away from zero if σa2 and σb2 are both non-zero, and we will have Sn/n=X¯npμ as long as K,D.

Now, let’s look at a scaled version of X¯nμ. We can’t just scale it by n as in the independent setting, because the variance of X¯n is O(M/n) rather than O(1/n). We can still hope for nσn2(X¯nμ)dN(0,1), since at least it’s the right size. In the simple multi-rater case it certainly works.

Encouragingly, we get the same sort of limit if we have M identical copies of each of m independent items, as another way to get sparse correlation.

Proving it?

The first thing I tried was to see if the moments of X¯n did anything helpful. I thought I had a proof, in fact, but it was wrong. The idea does work if you do it with cumulants; Svente Janson did it correctly, ten years earlier, and published it under the title Normal Convergence by Higher Semiinvariants with Applications to Sums of Dependent Random Variables and Random Graphs. I hadn’t found it in searching, and no-one I talked to at the time had heard of it.

The second thing I tried was to adapt a Central Limit Theorem for random fields by Xavier Guyon, which I’d already used in my thesis. The proof uses something called Stein’s Method, and involves a counting bound on ‘close’ pairs of pairs of observations and a long-range weak dependence bound on ‘distant’ pairs of pairs. I could just dump the ‘distant’ part of the proof and make it work for sparse correlation. Success!!

Nicole Mayer-Hamblett and I had a manuscript about asymptotics for sparsely-correlated generalised linear models, with some nice examples, and we got revision requests from a journal and then neither of us had time to do the revision. Then I worked out how to prove an exponential tail bound for sparse correlation and sent it off to a probability journal. They thought it was maybe ok, but that I needed to look up something called graph-structured dependence in the probability literature. That turned out to be the same Central Limit Theorem proof that I had (only a bit tidier and with an explicit error bound) and had already been published. In 1989. Sigh.

The Theorem

Let X1,,Xn be random variables such that E[Xi4]<, E[Xi]=0, σ2=var[iXi] and define S=iXi/σ. Let M be the maximal degree of a dependency graph for the Xi. Then for Z a standard Normal variable dW(S,Z)M2σ3iE|Xi3|+28M3/2πσ2iE|Xi4| where dW is the Wasserstein distance.

Scaling

The example of the parametric inter-rater experiment suggests that M/n0 should be enough for a central limit theorem, but the actual theorem seems to require a stronger condition. If the Xi were identically distributed and σ2 scaled as n, the first term would have order O(M2n/n3/2) and we would need M2n1/20. However, in the setup of the inter-rater experiment, if the rater and object variance components are non-zero, we actually have σ2 scaling as nM, so σ3 scales as M3/2n3/2. The first term in the bound is O(M1/2n1/2) and M/n0 is indeed sufficient.

Janson’s result using cumulants instead of Stein’s method doesn’t need the extra variance components to be non-zero, but it does impose stronger conditions on the tails of the Xi.

When working on asymptotic distributions for mixed-model parameters I’m reasonably happy to assume that true variance components are positive – if instead they are on the boundary of the parameter space, special arguments will be needed in any case. The variance assumptions are more of a concern when the graph-structured dependence is induced by sampling rather than by a model for the outcome.