Moving the goalposts? - Biased and Inefficient

There’s a paper in PNAS suggesting that lots of published scientific associations are likely to be false, and that Bayesian considerations imply a p-value threshold of 0.005 instead of 0.05 would be good. It’s had an impact outside the statistical world, eg, with a post on the blog Ars Technica. The motivation for the PNAS paper is a statistics paper showing how to relate p-values to Bayes Factors in some tests.

Some people have asked me what I think. So.

1. I much prefer the other way (non-paywalled tech report) to get classical p-values as part of an optimal Bayesian decision, because it’s based on estimation rather than on identifying arbitrary alternatives, and it seems to correspond better to what my scientific colleagues are trying to do with p-values. Ok, and because Ken is a friend.

2. The PNAS paper starts off by talking about reproducibility in terms of scientific fraud and slides into talking about publishing results that don’t meet the proposed new p<0.005. I’m not exaggerating: here’s the complete first paragraph

Reproducibility of scientific research is critical to the scientific endeavor, so the apparent lack of reproducibility threatens the credibility of the scientific enterprise (e.g., refs. 1 and 2). Unfortunately, concern over the nonreproducibility of scientific studies has become so pervasive that a Web site, Retraction Watch, has been established to monitor the large number of retracted papers, and methodology for detecting flawed studies has developed nearly into a scientific discipline of its own

That’s not a rhetorical device I’m happy with, to put it mildly.

3. If you don’t use p-value thresholds as a publishing criterion, the change won’t have any impact. And if you think p-value thresholds should be a publishing criterion, you’ve got worse problems than reproducibility.

4. False negatives are errors, too. People already report “there was no association between X and Y” (or worse “there was no effect of X on Y”) in subgroups where the p-value is greater than 0.05. If you have the same data and decrease the false positives you have to increase the false negatives.

5. The problem isn’t the threshold so much as the really weak data in a lot of research, especially small-sample experimental research [large-sample observational research has different problems]. Larger sample sizes or better experimental designs would actually reduce the error rate; moving the threshold only swaps which kind of error you make.

6. Standards are valuable in scientific writing, but only to the extent that they reduce communication costs. That applies to statistical terminology as much as it applies to structured abstracts. Changing standards imposes substantial costs and is only worth it if there are substantial benefits.

7. And finally, why is it a disaster that a single study doesn’t always reach the correct answer? Why would any reasonable person expect it to? It’s not as if we have to ignore everything except the results of that one experiment in making any decisions.