As part of adding crossed correlations (and other sparse correlation) to the survey
package, I was writing code to test whether a user-supplied adjacency matrix had only 0 and 1 values. Since this is just a validity test for a user-supplied argument, it will usually pass and needs to pass as fast as possible. How fast it fails is less important. Also, since the survey
package is pure R, it needs to be pure R.
For a matrix
object, the obvious solution is all(m %in% c(0,1))
, but these are going to be potentially large matrices that are mostly zero, so supporting sparse Matrix
objects is important. These store the non-zero elements of the matrix, with a couple of integer vectors to indicate which elements those are. Converting a sparse Matrix
object to a traditional matrix
will force all the implicit zeroes to be filled in, making the object much larger
If m
is a sparse Matrix
object, all(m %in% c(0,1))
will be inefficient. It also won’t work, because %in%
doesn’t have a method for these objects, but all(as.matrix(m) %in% c(0,1))
will work and be inefficient. There are two sorts of strategies. One is to do something clever with the properties of binary matrices to get an efficient solution, the other is to open up the object and work with however it stores the non-zero elements. There’s actually also a third class of strategy worth considering: brute force and ignorance.
I’m going to work with two examples here. One is a plausible size for a large set of measurements. The other is a plausible size for a large set of pairs of measurements (and so is bigger). They are both block diagonal. The first one has 500 blocks of size \(10\times 10\); the second has 320 blocks of size \(100\times 100\)
library(Matrix)
m0<-bdiag(lapply(1:500, function(i) matrix(1L,10,10) ))
## number of elements
prod(dim(m0))
## [1] 2.5e+07
## size
object.size(m0)
## 621504 bytes
m1<-bdiag(lapply(1:320, function(i) matrix(1,100,100) ))
## number of elements
prod(dim(m1))
## [1] 1.024e+09
## size
object.size(m1)
## 38529504 bytes
On my home computer, as.matrix(m1)
is a bit too big to fit in memory, but as.matrix(m0)
fits easily. In sparse form, the first matrix uses about 0.025 bytes per element and the second uses about 0.04 bytes per element. They’d use about 8 bytes per element (plus a small amount of overhead) as matrix
objects
Clever matrix ideas
Peter Dalgaard and either CT Bergstrom or Jevin West (@callin_bull
) suggested \(M\cdot M-M=0\) as an identity satisfied by binary matrices. This has potential, because the elementwise product and subtracting and testing might all be done on the sparse representation. On the other hand, the subtraction operator might not realise \(M\cdot M\) and \(M\) have the same sparsity pattern, so the implementation could be unnecessarily slow. On top of that, the final check against zero could be slow.
system.time(z0<-m0*m0-m0)
## user system elapsed
## 0.005 0.000 0.006
system.time(all(z0==0))
## user system elapsed
## 0.175 0.032 0.216
system.time(z1<-m1*m1-m1)
## user system elapsed
## 0.398 0.066 0.498
system.time(all(z1==0))
## user system elapsed
## 5.139 2.649 9.628
It turns out that the test vs zero (z1==0
) creates a dense logical matrix, which I wasn’t expecting. However, identical
seems to work efficiently:
system.time(identical(m0*m0,m0))
## user system elapsed
## 0.003 0.001 0.004
system.time(identical(m1*m1,m1))
## user system elapsed
## 0.192 0.036 0.235
On the other hand, using identical
is a bit fragile, because any additional attributes will make it fail:
m2<-m0
attr(m2,"trombones")<-76
identical(m2*m2,m2)
## [1] FALSE
Using the internal representation
In the comic fantasy novel One for the Morning Glory, there’s a reference to a book Highly Unpleasant Things it is Sometimes Necessary to Know. The internal representation of other people’s objects falls into this category. In contrast to a lot of other programming languages, R doesn’t actually stop you from meddling with these internals, it just trusts you to know when to stop.
Fortunately, the Matrix
class was designed to be infrastructure for other people’s code, and sparse matrices are established technology, so it’s pretty stable and well documented. The non-zero elements of the matrix1 are stored in the x
slot. So
system.time(all(m0@x %in% c(0,1)))
## user system elapsed
## 0 0 0
system.time(all(m1@x %in% c(0,1)))
## user system elapsed
## 0.019 0.002 0.021
There are slightly faster ways of testing whether all the non-zero entries are 0 or 1. Gabe Becker suggested
system.time(
length(setdiff(unique(m0@x), c(0, 1)))
)
## user system elapsed
## 0.001 0.000 0.001
system.time(
length(setdiff(unique(m1@x), c(0, 1)))
)
## user system elapsed
## 0.020 0.003 0.022
This microbenchmarks to be faster, but not so as you’d notice. Also, from Michael T Young
system.time({
n_zero <- length(m0) - nnzero(m0)
n_zero == length(m0) || n_zero + sum(m0@x == 1L) == length(m0)
})
## user system elapsed
## 0.000 0.000 0.002
system.time({
n_zero <- length(m1) - nnzero(m1)
n_zero == length(m1) || n_zero + sum(m1@x == 1L) == length(m1)
})
## user system elapsed
## 0.012 0.001 0.013
which is faster for the large example and maybe slower for the small one2.
Brute force and ignorance
Computers have gotten faster over the years3, and it’s surprising how often a brute force solution to a problem is now feasible.
In this example
system.time(all(as.vector(m0) %in% c(0,1)))
## user system elapsed
## 0.155 0.042 0.198
That’s relatively slow, but it’s about a third of a second. We could live with it.
For a submatrix of 1/4 of the bigger example it’s still a bit too slow
system.time(all(as.vector(m1[1:16000,1:16000]) %in% c(0,1)))
## user system elapsed
## 1.856 1.012 3.496
Decisions
I like the identical
solution that doesn’t use the matrix internals, and I … appreciate… Michael Young’s solution, but in the end I went with something less fragile than the former and easier to understand than the latter
is.binary<-function(object) {
if (is.matrix(object)){
all(object %in% c(0,1))
} else if (inherits(object, "lMatrix") || inherits(object,"nMatrix")){
TRUE
} else{
all(object@x %in% c(0,1))
}
}
The middle check is for two Matrix
classes that are always binary: a logical matrix and a 0/1 ‘pattern matrix’