Is it binary? - Biased and Inefficient

As part of adding crossed correlations (and other sparse correlation) to the survey package, I was writing code to test whether a user-supplied adjacency matrix had only 0 and 1 values. Since this is just a validity test for a user-supplied argument, it will usually pass and needs to pass as fast as possible. How fast it fails is less important. Also, since the survey package is pure R, it needs to be pure R.

For a matrix object, the obvious solution is all(m %in% c(0,1)), but these are going to be potentially large matrices that are mostly zero, so supporting sparse Matrix objects is important. These store the non-zero elements of the matrix, with a couple of integer vectors to indicate which elements those are. Converting a sparse Matrix object to a traditional matrix will force all the implicit zeroes to be filled in, making the object much larger

If m is a sparse Matrix object, all(m %in% c(0,1)) will be inefficient. It also won’t work, because %in% doesn’t have a method for these objects, but all(as.matrix(m) %in% c(0,1)) will work and be inefficient. There are two sorts of strategies. One is to do something clever with the properties of binary matrices to get an efficient solution, the other is to open up the object and work with however it stores the non-zero elements. There’s actually also a third class of strategy worth considering: brute force and ignorance.

I’m going to work with two examples here. One is a plausible size for a large set of measurements. The other is a plausible size for a large set of pairs of measurements (and so is bigger). They are both block diagonal. The first one has 500 blocks of size $10 \times 10$ ; the second has 320 blocks of size $100 \times 100$

library(Matrix)
m0<-bdiag(lapply(1:500, function(i) matrix(1L,10,10) ))
## number of elements
prod(dim(m0))

## [1] 2.5e+07

## size
object.size(m0)

## 621504 bytes

m1<-bdiag(lapply(1:320, function(i) matrix(1,100,100) ))
## number of elements
prod(dim(m1))

## [1] 1.024e+09

## size
object.size(m1)

## 38529504 bytes

On my home computer, as.matrix(m1) is a bit too big to fit in memory, but as.matrix(m0) fits easily. In sparse form, the first matrix uses about 0.025 bytes per element and the second uses about 0.04 bytes per element. They’d use about 8 bytes per element (plus a small amount of overhead) as matrix objects

Clever matrix ideas

Peter Dalgaard and either CT Bergstrom or Jevin West (@callin_bull) suggested $M \cdot M - M = 0$ as an identity satisfied by binary matrices. This has potential, because the elementwise product and subtracting and testing might all be done on the sparse representation. On the other hand, the subtraction operator might not realise $M \cdot M$ and $M$ have the same sparsity pattern, so the implementation could be unnecessarily slow. On top of that, the final check against zero could be slow.

system.time(z0<-m0*m0-m0)

##    user  system elapsed 
##   0.005   0.000   0.006

system.time(all(z0==0))

##    user  system elapsed 
##   0.175   0.032   0.216

system.time(z1<-m1*m1-m1)

##    user  system elapsed 
##   0.398   0.066   0.498

system.time(all(z1==0))

##    user  system elapsed 
##   5.139   2.649   9.628

It turns out that the test vs zero (z1==0) creates a dense logical matrix, which I wasn’t expecting. However, identical seems to work efficiently:

system.time(identical(m0*m0,m0))

##    user  system elapsed 
##   0.003   0.001   0.004

system.time(identical(m1*m1,m1))

##    user  system elapsed 
##   0.192   0.036   0.235

On the other hand, using identical is a bit fragile, because any additional attributes will make it fail:

m2<-m0
attr(m2,"trombones")<-76
identical(m2*m2,m2)

## [1] FALSE

Using the internal representation

In the comic fantasy novel One for the Morning Glory, there’s a reference to a book Highly Unpleasant Things it is Sometimes Necessary to Know. The internal representation of other people’s objects falls into this category. In contrast to a lot of other programming languages, R doesn’t actually stop you from meddling with these internals, it just trusts you to know when to stop.

Fortunately, the Matrix class was designed to be infrastructure for other people’s code, and sparse matrices are established technology, so it’s pretty stable and well documented. The non-zero elements of the matrix¹ are stored in the x slot. So

system.time(all(m0@x %in% c(0,1)))

##    user  system elapsed 
##       0       0       0

system.time(all(m1@x %in% c(0,1)))

##    user  system elapsed 
##   0.019   0.002   0.021

There are slightly faster ways of testing whether all the non-zero entries are 0 or 1. Gabe Becker suggested

system.time(
length(setdiff(unique(m0@x), c(0, 1)))
)

##    user  system elapsed 
##   0.001   0.000   0.001

system.time(
length(setdiff(unique(m1@x), c(0, 1))) 
)

##    user  system elapsed 
##   0.020   0.003   0.022

This microbenchmarks to be faster, but not so as you’d notice. Also, from Michael T Young

system.time({
n_zero <- length(m0) - nnzero(m0) 
n_zero == length(m0) || n_zero + sum(m0@x == 1L) == length(m0)
})

##    user  system elapsed 
##   0.000   0.000   0.002

system.time({
n_zero <- length(m1) - nnzero(m1) 
n_zero == length(m1) || n_zero + sum(m1@x == 1L) == length(m1)
})

##    user  system elapsed 
##   0.012   0.001   0.013

which is faster for the large example and maybe slower for the small one².

Brute force and ignorance

Computers have gotten faster over the years³, and it’s surprising how often a brute force solution to a problem is now feasible.

In this example

system.time(all(as.vector(m0) %in% c(0,1)))

##    user  system elapsed 
##   0.155   0.042   0.198

That’s relatively slow, but it’s about a third of a second. We could live with it.

For a submatrix of 1/4 of the bigger example it’s still a bit too slow

system.time(all(as.vector(m1[1:16000,1:16000]) %in% c(0,1)))

##    user  system elapsed 
##   1.856   1.012   3.496

Decisions

I like the identical solution that doesn’t use the matrix internals, and I … appreciate… Michael Young’s solution, but in the end I went with something less fragile than the former and easier to understand than the latter

is.binary<-function(object) {

    if (is.matrix(object)){
       all(object %in% c(0,1))
    } else if (inherits(object, "lMatrix") || inherits(object,"nMatrix")){
        TRUE
    } else{
        all(object@x %in% c(0,1))
    }
}

The middle check is for two Matrix classes that are always binary: a logical matrix and a 0/1 ‘pattern matrix’

some of which may actually be zero, pro tip↩︎
though not so as you’d actually notice↩︎
[citation needed]↩︎