As part of adding crossed correlations (and other sparse correlation) to the `survey`

package, I was writing code to test whether a user-supplied adjacency matrix had only 0 and 1 values. Since this is just a validity test for a user-supplied argument, it will usually pass and needs to pass as fast as possible. How fast it fails is less important. Also, since the `survey`

package is pure R, it needs to be pure R.

For a `matrix`

object, the obvious solution is `all(m %in% c(0,1))`

, but these are going to be potentially large matrices that are mostly zero, so supporting sparse `Matrix`

objects is important. These store the non-zero elements of the matrix, with a couple of integer vectors to indicate which elements those are. Converting a sparse `Matrix`

object to a traditional `matrix`

will force all the implicit zeroes to be filled in, making the object much larger

If `m`

is a sparse `Matrix`

object, `all(m %in% c(0,1))`

will be inefficient. It also won’t work, because `%in%`

doesn’t have a method for these objects, but `all(as.matrix(m) %in% c(0,1))`

will work and be inefficient. There are two sorts of strategies. One is to do something clever with the properties of binary matrices to get an efficient solution, the other is to open up the object and work with however it stores the non-zero elements. There’s actually also a third class of strategy worth considering: brute force and ignorance.

I’m going to work with two examples here. One is a plausible size for a large set of measurements. The other is a plausible size for a large set of *pairs* of measurements (and so is bigger). They are both block diagonal. The first one has 500 blocks of size \(10\times 10\); the second has 320 blocks of size \(100\times 100\)

```
library(Matrix)
m0<-bdiag(lapply(1:500, function(i) matrix(1L,10,10) ))
## number of elements
prod(dim(m0))
```

`## [1] 2.5e+07`

```
## size
object.size(m0)
```

`## 621504 bytes`

```
m1<-bdiag(lapply(1:320, function(i) matrix(1,100,100) ))
## number of elements
prod(dim(m1))
```

`## [1] 1.024e+09`

```
## size
object.size(m1)
```

`## 38529504 bytes`

On my home computer, `as.matrix(m1)`

is a bit too big to fit in memory, but `as.matrix(m0)`

fits easily. In sparse form, the first matrix uses about 0.025 bytes per element and the second uses about 0.04 bytes per element. They’d use about 8 bytes per element (plus a small amount of overhead) as `matrix`

objects

### Clever matrix ideas

Peter Dalgaard and either CT Bergstrom or Jevin West (`@callin_bull`

) suggested \(M\cdot M-M=0\) as an identity satisfied by binary matrices. This has potential, because the elementwise product and subtracting and testing might all be done on the sparse representation. On the other hand, the subtraction operator might not realise \(M\cdot M\) and \(M\) have the same sparsity pattern, so the implementation could be unnecessarily slow. On top of that, the final check against zero could be slow.

`system.time(z0<-m0*m0-m0)`

```
## user system elapsed
## 0.005 0.000 0.006
```

`system.time(all(z0==0))`

```
## user system elapsed
## 0.175 0.032 0.216
```

`system.time(z1<-m1*m1-m1)`

```
## user system elapsed
## 0.398 0.066 0.498
```

`system.time(all(z1==0))`

```
## user system elapsed
## 5.139 2.649 9.628
```

It turns out that the test vs zero (`z1==0`

) creates a dense logical matrix, which I wasn’t expecting. However, `identical`

seems to work efficiently:

`system.time(identical(m0*m0,m0))`

```
## user system elapsed
## 0.003 0.001 0.004
```

`system.time(identical(m1*m1,m1))`

```
## user system elapsed
## 0.192 0.036 0.235
```

On the other hand, using `identical`

is a bit fragile, because any additional attributes will make it fail:

```
m2<-m0
attr(m2,"trombones")<-76
identical(m2*m2,m2)
```

`## [1] FALSE`

### Using the internal representation

In the comic fantasy novel *One for the Morning Glory*, there’s a reference to a book *Highly Unpleasant Things it is Sometimes Necessary to Know*. The internal representation of other people’s objects falls into this category. In contrast to a lot of other programming languages, R doesn’t actually stop you from meddling with these internals, it just trusts you to know when to stop.

Fortunately, the `Matrix`

class was designed to be infrastructure for other people’s code, and sparse matrices are established technology, so it’s pretty stable and well documented. The non-zero elements of the matrix^{1} are stored in the `x`

slot. So

`system.time(all(m0@x %in% c(0,1)))`

```
## user system elapsed
## 0 0 0
```

`system.time(all(m1@x %in% c(0,1)))`

```
## user system elapsed
## 0.019 0.002 0.021
```

There are slightly faster ways of testing whether all the non-zero entries are 0 or 1. Gabe Becker suggested

```
system.time(
length(setdiff(unique(m0@x), c(0, 1)))
)
```

```
## user system elapsed
## 0.001 0.000 0.001
```

```
system.time(
length(setdiff(unique(m1@x), c(0, 1)))
)
```

```
## user system elapsed
## 0.020 0.003 0.022
```

This microbenchmarks to be faster, but not so as you’d notice. Also, from Michael T Young

```
system.time({
n_zero <- length(m0) - nnzero(m0)
n_zero == length(m0) || n_zero + sum(m0@x == 1L) == length(m0)
})
```

```
## user system elapsed
## 0.000 0.000 0.002
```

```
system.time({
n_zero <- length(m1) - nnzero(m1)
n_zero == length(m1) || n_zero + sum(m1@x == 1L) == length(m1)
})
```

```
## user system elapsed
## 0.012 0.001 0.013
```

which is faster for the large example and maybe slower for the small one^{2}.

### Brute force and ignorance

Computers have gotten faster over the years^{3}, and it’s surprising how often a brute force solution to a problem is now feasible.

In this example

`system.time(all(as.vector(m0) %in% c(0,1)))`

```
## user system elapsed
## 0.155 0.042 0.198
```

That’s *relatively* slow, but it’s about a third of a second. We could live with it.

For a submatrix of 1/4 of the bigger example it’s still a bit too slow

`system.time(all(as.vector(m1[1:16000,1:16000]) %in% c(0,1)))`

```
## user system elapsed
## 1.856 1.012 3.496
```

### Decisions

I like the `identical`

solution that doesn’t use the matrix internals, and I … appreciate… Michael Young’s solution, but in the end I went with something less fragile than the former and easier to understand than the latter

```
is.binary<-function(object) {
if (is.matrix(object)){
all(object %in% c(0,1))
} else if (inherits(object, "lMatrix") || inherits(object,"nMatrix")){
TRUE
} else{
all(object@x %in% c(0,1))
}
}
```

The middle check is for two `Matrix`

classes that are always binary: a logical matrix and a 0/1 ‘pattern matrix’