The svy2lme
package for linear mixed models under complex sampling may still contain nuts, but at least the user interface has settled down and it gives plausible answers for some toy examples. The recent change is to compute pairwise sampling probabilities from a survey design object, rather than some horrible set of separate specifications. It still doesn’t support complicated PPS designs, but since the survey package does, that should be feasible.
Here’s a toy example, with comparison to Stata’s mixed
function using nested pseudolikelihood. First, simulate some data. We create a population with 1000 clusters of size 20, then take a biased sample of 177 of the clusters and a simple random sample of 5 observations from each cluster.
set.seed(2000-2-29)
df<-data.frame(x=rnorm(1000*20),g=rep(1:1000,each=20), t=rep(1:20,1000), id=1:20000)
df$u<-with(df, rnorm(1000)[g])
df$y<-with(df, x+u+rnorm(1000,s=2))
## oversample extreme `u` to bias random-intercept variance
pg<-exp(abs(df$u/2)-2.2)[df$t==1]
in1<-rbinom(1000,1,pg)==1
in2<-rep(1:5, length(in1))
sdf<-subset(df, (g %in% (1:1000)[in1]) & (t %in% in2))
p1<-rep(pg[in1],each=5)
N2<-rep(20,nrow(sdf))
sdf$w1<-1/p1
sdf$w2<-20/5
design<-survey::svydesign(id=~g+id, data=sdf, weights=~w1+w2)
Fitting the linear mixed model looks like it does using lmer
pair<-svylme::svy2lme(y~x+(1|g),design=design)
pair
## Linear mixed model fitted by pairwise likelihood
## Formula: y ~ x + (1 | g)
## Random effects:
## Std.Dev.
## (Intercept) 1.0162
## Residual: 1.9312
## Fixed effects:
## beta SE t p
## (Intercept) -0.0748 0.0988 -0.7572 0.4489
## x 1.0652 0.0792 13.4471 0.0000
Now, in Stata: the command is
mixed y x [pw=w2]|| g:,pweight(w1) pwscale(gk)
giving output
Wald chi2(1) = 173.51
Log pseudolikelihood = -44980.757 Prob > chi2 = 0.0000
(Std. Err. adjusted for 177 clusters in g)
------------------------------------------------------------------------------
| Robust
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.046958 .0794822 13.17 0.000 .8911759 1.20274
_cons | -.075619 .098788 -0.77 0.444 -.2692398 .1180019
------------------------------------------------------------------------------
------------------------------------------------------------------------------
| Robust
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
g: Identity |
var(_cons) | 1.034899 .1671189 .7541251 1.420211
-----------------------------+------------------------------------------------
var(Residual) | 3.728113 .2021218 3.352283 4.146078
------------------------------------------------------------------------------
So you don’t have to do square roots yourself, here are the R variance components on the variance scale
coef(pair, random=TRUE)
## $s2
## [,1]
## [1,] 3.729575
##
## $varb
## (Intercept)
## (Intercept) 1.032578
And, for comparison, an unweighted model: the sampling is doing something
lme4::lmer(y~x+(1|g),data=sdf)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ x + (1 | g)
## Data: sdf
## REML criterion at convergence: 3870.223
## Random effects:
## Groups Name Std.Dev.
## g (Intercept) 1.165
## Residual 1.939
## Number of obs: 885, groups: g, 177
## Fixed Effects:
## (Intercept) x
## -0.1024 1.0639
It would be really nice to have a proper example, but the ones I can find in the literature are like if you wade through all these files and extract these variables and do some additional calculations you’ll get the data we used. And, you know, eventually I probably will, but I’m still hoping for something more concrete.