Rock, paper, scissors, Wilcoxon test - Biased and Inefficient

Based on my nerdnite talk last week.

Transitivity is a basic property of orderings: if A is better than B and B is better than C, then A must be better than C. For example, if the All Blacks beat Tonga and Tonga beats Japan, we would expect the All Blacks to beat Japan.

Rock-paper-scissors is interesting because it is the opposite: if A beats B and B beats C then A must lose to C. The ‘winning’ relationship is non-transitive – as non-transitive as possible. In this case that’s the whole point, since you want all the choices to be equally good if you don’t know what your opponent is going to do.

In preparation for the talk I looked at the history of rock-paper-scissors, which largely meant following links from Wikipedia until they reached someone who looked as though they knew what they were talking about.

There’s a very interesting paper by a Finnish ethnologist (!) on the history of the rock-paper-scissors games in Asia. They date back to the Han dynasty in China, and became popular in Japan, partly as a drinking game. There were lots of variants (frog-snake-snail, fox-mayor-hunter, hero-tiger-hero’s mother) as well as the familiar rock-paper-scissors. The game apparently arrived in Europe in the early 20th century.

The paper also mentions the existence of five-way versions, eg one in Malaysia. These are independent of the more-familiar rock-paper-scissors-lizard-Spock, which was invented by Sam Kass and Karen Bryla, and popularised by the TV show “The Big Bang Theory”. Of course, once you raise the possibility of more than three categories, people get creative.

There are also examples of frequency-dependent selection preserving non-transitive games in evolution. The common side-blotched lizard comes in three colours with different mating strategies, and for these to all stay present in the population something strange must happen, and it does.

For statisticians, the most familiar example of non-transitive games is probably the non-transitive dice first discovered by Brad Efron in the early 1970s. Efron’s four dice have numbers

4, 4, 4, 4, 0, 0
3, 3, 3, 3, 3, 3
6, 6, 2, 2, 2, 2
5, 5, 5, 1, 1, 1

Each die beats the following one with probability 2/3, and the last one beats the first with probability 2/3.

It’s surprising that numbers can behave this badly, but we’re not really comparing single numbers; we’re comparing whole probability distributions. Translating ordering of values into order of distributions turns out to be surprisingly hard.

The problem is related to design of voting systems: how do you turn a set of individual preferences into an ordering of candidates, so you can find the ‘most preferred’ candidate. Condorcet, in the late 18th century, noticed that this is hard – if you have three candidates, it’s possible that each candidate would beat one of their competitors and lose to the other one in two-way elections. That is, two-way voting comparisons can have the rock-paper-scissors property. Kenneth Arrow nailed down more of the details, coming up with a simple set of obviously necessary properties for a sane voting system and then showing that they were impossible to satisfy.

However, for statisticians there is another important and much less well known aspect of the non-transitive dice. The dice show that when you compare probability distributions by ‘probability that a random observation from A beats a random observation from B’ you can get non-transitivity. This way of doing comparisons of probability distributions is better known in statistics as the Mann-Whitney U test or Wilcoxon rank-sum test. In fact, essentially all rank tests are non-transitive.

The reason rank tests are non-transitive is related to one of their apparent advantages: they don’t depend on the scale of the data, and give the same result if you take an arbitrary (increasing) transformation. That’s especially attractive for ordinal data where the appropriate scale may not be obvious. However, if you think in terms of disease and treatment it becomes clearer why scale independence is actually a bug, not a feature.

Suppose you have a treatment that makes some people better and other people worse, and you can’t work out in advance which people will benefit. Is this a good treatment? The answer has to depend on the tradeoffs: how much worse and how much better, not just on how many people are in each group.

If you have a way of making the decision that doesn’t explicitly evaluate the tradeoffs, it can’t possibly be right. The rank tests make the tradeoffs in a way that changes depending on what treatment you’re comparing to, and one extreme consequence is that they can be non-transitive. Much more often, though, they can just be misleading.

It’s possible to prove that every transitive test reduces each sample to a single number and then compares those numbers [equivalent to Debreu’s theorem in utility theory]. That is, if you want an internally consistent ordering over all possible results of your experiment, you can’t escape assigning numerical scores to each observation.

In some scenarios, such as small-sample biological experiments where you expect the direction of effect to be the same for all experimental units, any test will give qualitatively the same true direction of change and rank tests may be sensible because they are exact in small samples.

In most settings, though, quite possible that different sets of scores will lead to different conclusions. For example, it’s entirely plausible and does happen that a medical treatment can raise median medical costs (because it has to be paid for) but lower mean medical costs (by preventing expensive complications). That’s sometimes used as an argument in favour of rank tests, but it’s actually an argument against them.