Comparing tests for generalised linear models in survey data

In ordinary statistics, there are three popular types of model-based tests: score tests, Wald tests, and likelihood ratio tests. They agree pretty well; they are locally asymptotically equivalent; the picture is here.

In survey data, things get more complicated because you’re not actually fitting by maximum likelihood. There are two branches to testing

‘working model’ tests, where you take the test statistic you would use under iid sampling and compute its actual distribution under the null hypothesis that the population data generating process satisfies the null hypothesis. In the survey literature these are well known for loglinear models, where they are called Rao-Scott tests, after J.N.K. Rao and the late Alastair Scott. Alastair and I extended them to a wider class of models including generalised linear models.
tests that don’t have a good name based on taking the denominator of the test statistic and standardising it by a consistent estimator of its variance

The ‘working’ or ‘Rao-Scott’ tests are asymptotically equivalent to each other. The null distribution for a single parameter is a multiple of $χ_{1}^{2}$ , but for multi-parameter tests it’s a linear combination of $χ_{1}^{2}$ and not (in general) a $χ_{p}^{2}$ . We can handle this. There are Wald, score, and likelihood ratio tests, though the Wald tests haven’t historically been used much.

The other tests have $χ_{p}^{2}$ distributions, so that’s easy. There are Wald and score tests, but not likelihood ratio tests.

So we have five tests. Actually, we have ten tests, because we can put in an ad-hoc-but-I’m-not-apologising denominator degrees of freedom and get an F test. Kuiyan Shao did a dissertation project with me comparing all these tests. We found that the working (Rao-Scott) tests control Type I error better than the others, and that the F test is helpful. I was a bit surprised that the working Wald test did as well as it did – the statistical folklore is that score and likelihood ratio tests do better than Wald-type tests.

Interestingly, the other tests, especially the Wald-type test, had better power than the working (Rao-Scott) tests, at honest level. That can’t be uniformly true, but we weren’t trying to rig it and that’s what we saw.

The Rao-Scott-type tests weight different directions of departure from the null according to their precision in the population and the other tests weight according to the precision in the sample. The advantage of the Rao-Scott-type tests is, thus, that the weighting of different directions doesn’t depend on the survey design; the disadvantage is that the weighting of different directions doesn’t depend on the survey designs. You could easily imagine that weighting according to precision in the sample would tend to be more powerful, I suppose.

In practice, the fact that the other tests (especially the Wald-type test) do not accurately control their Type I error makes it harder to take advantage of any increase in power. You can’t easily tell how much the low $p$ -value you see is due to departure from the null and how much is due to the test being poorly calibrated.

Anyway, survey::regTermTest does both sorts of Wald test and the Rao-Scott LRT, and in the next version of the package svyscoretest will do both sorts of score tests. And maybe we’ll get around to publishing an actual paper on Kuiyan’s Honours project.