Progress on svy2lme - Biased and Inefficient

The svy2lme package for linear mixed models under complex sampling may still contain nuts, but at least the user interface has settled down and it gives plausible answers for some toy examples. The recent change is to compute pairwise sampling probabilities from a survey design object, rather than some horrible set of separate specifications. It still doesn’t support complicated PPS designs, but since the survey package does, that should be feasible.

Here’s a toy example, with comparison to Stata’s mixed function using nested pseudolikelihood. First, simulate some data. We create a population with 1000 clusters of size 20, then take a biased sample of 177 of the clusters and a simple random sample of 5 observations from each cluster.

set.seed(2000-2-29)
df<-data.frame(x=rnorm(1000*20),g=rep(1:1000,each=20), t=rep(1:20,1000), id=1:20000)
df$u<-with(df, rnorm(1000)[g])
df$y<-with(df, x+u+rnorm(1000,s=2))

## oversample extreme `u` to bias random-intercept variance
pg<-exp(abs(df$u/2)-2.2)[df$t==1]
in1<-rbinom(1000,1,pg)==1
in2<-rep(1:5, length(in1))
sdf<-subset(df, (g %in% (1:1000)[in1]) & (t %in% in2))

p1<-rep(pg[in1],each=5)
N2<-rep(20,nrow(sdf))

sdf$w1<-1/p1
sdf$w2<-20/5
 
design<-survey::svydesign(id=~g+id, data=sdf, weights=~w1+w2)

Fitting the linear mixed model looks like it does using lmer

pair<-svylme::svy2lme(y~x+(1|g),design=design)
pair

## Linear mixed model fitted by pairwise likelihood
## Formula: y ~ x + (1 | g)
## Random effects:
##             Std.Dev.
## (Intercept)   1.0162
## Residual:     1.9312
##  Fixed effects:
##                beta     SE       t      p
## (Intercept) -0.0748 0.0988 -0.7572 0.4489
## x            1.0652 0.0792 13.4471 0.0000

Now, in Stata: the command is

mixed y x [pw=w2]|| g:,pweight(w1) pwscale(gk)

giving output

                                                Wald chi2(1)      =     173.51
Log pseudolikelihood = -44980.757               Prob > chi2       =     0.0000

                                    (Std. Err. adjusted for 177 clusters in g)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.046958   .0794822    13.17   0.000     .8911759     1.20274
       _cons |   -.075619    .098788    -0.77   0.444    -.2692398    .1180019
------------------------------------------------------------------------------

------------------------------------------------------------------------------
                             |               Robust           
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
g: Identity                  |
                  var(_cons) |   1.034899   .1671189      .7541251    1.420211
-----------------------------+------------------------------------------------
               var(Residual) |   3.728113   .2021218      3.352283    4.146078
------------------------------------------------------------------------------

So you don’t have to do square roots yourself, here are the R variance components on the variance scale

coef(pair, random=TRUE)

## $s2
##          [,1]
## [1,] 3.729575
## 
## $varb
##             (Intercept)
## (Intercept)    1.032578

And, for comparison, an unweighted model: the sampling is doing something

lme4::lmer(y~x+(1|g),data=sdf)

## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ x + (1 | g)
##    Data: sdf
## REML criterion at convergence: 3870.223
## Random effects:
##  Groups   Name        Std.Dev.
##  g        (Intercept) 1.165   
##  Residual             1.939   
## Number of obs: 885, groups:  g, 177
## Fixed Effects:
## (Intercept)            x  
##     -0.1024       1.0639

It would be really nice to have a proper example, but the ones I can find in the literature are like if you wade through all these files and extract these variables and do some additional calculations you’ll get the data we used. And, you know, eventually I probably will, but I’m still hoping for something more concrete.