As I wrote before, there’s a polymath (large-scale collaborative pure maths) project on transitivity of dice. Here’s the latest update from Timothy Gowers’s blog.
Suppose , , and are discrete distributions supported on . We can ask about and and , which is what the Wilcoxon/Mann-Whitney rank test does.
The project has basically proved that under one model for randomly choosing distributions, if , , and have the same mean and and , the probability of is . That is, if three distributions have the same mean, and the Wilcoxon test says is bigger than and is bigger than , you’ve got essentially no information about whether it will say is bigger than .
Gowers also says they are close to showing a converse: if the means are different, then , and are true or false they way you’d assume from the ordering of the means.
That is, we knew the Wilcoxon test does not give a self-consistent ordering on all distributions. Now we know that (for this particular model of discrete distributions) when it does give an ordering, the ordering is typically the same as the ordering by means.