(high-dimensional) Space is Big. - Biased and Inefficient

Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Hitchhikers Guide to the Galaxy

There’s a simple simulation that I used in stat computing class last year and for Rob Hyndman’s working group last week. Simulate data uniformly on a -dimensional hypercube and compute nearest-neighbour distances.

Obviously these distances increase with , since the squared Euclidean distance is a sum of non-negative terms. The mean distance between two random points is . How does this compare to the nearest-neighbour distance?

> x1<-runif(5e4)  
> x2<-matrix(runif(5e4*2),ncol=2)  
> x10<-matrix(runif(5e4*10),ncol=10)  
> x100<-matrix(runif(5e4*100),ncol=100)  
  
system.time(d1<-knn.dist(x1,k=1))  
   user  system elapsed  
  0.044   0.002   0.045  
> system.time(d2<-knn.dist(x2,k=1))  
   user  system elapsed  
  0.066   0.002   0.088  
> system.time(d10<-knn.dist(x10,k=1))  
   user  system elapsed  
  4.005   0.010   3.993  
> system.time(d100<-knn.dist(x100,k=1))  
   user  system elapsed  
 705.281   0.996 702.481  
> mean(d1/sqrt(1)/(1/3))  
 [1] 2.994883e-05  
> mean(d2/sqrt(2)/(1/3))  
 [1] 0.004764815  
> mean(d10/sqrt(10)/(1/3))  
[1] 0.306989  
> mean(d100/sqrt(100)/(1/3))  
[1] 0.9269603

The typical nearest-neighbour distance for 50,000 uniform points in is 0.00003 times the mean interpoint distance, which is sort of what you’d expect. In it’s 0.005 times the mean distance, which is not too surprising.

In , though, the nearest-neighbour distance is over 90% of the mean distance. If you haven’t seen this sort of computation before, that seems outrageous. Roughly speaking, in 100-dimensional space, if you have 50,000 points in a unit cube they are all about equidistant from each other!

If you think of the interpoint distance as a sum, that makes sense in terms of the Law of Large Numbers and Central Limit Theorem. The squared Euclidean distance is a sum of approximately independent things, so it should be close to its expected value; differing from it by a relative amount . If we have independent random variables like that, and they don’t have heavier tails than a Gaussian (which they don’t), the worst departure from the expected value should scale like In other words, nearest-neighbours can’t be very much closer than the average interpoint distance.

Many ‘black-box’ prediction methods are based on using outcome values from nearby points to predict outcome values for new points. This can’t work in high-dimensional space; there are no nearby points.

In a sense, everyone knows this. Statisticians call it the ‘curse of dimensionality’, and modern sparse regression methods are aimed at bypassing it by not working in high-dimensional spaces. It’s still more dramatic that most people realise, and it’s hard for our intuition to cope, to think of a high-dimensional unit cube or sphere and ‘believe how vastly, hugely, mind-bogglingly big it is’.