In our last episode, we worried about the penalised least squares criterion for linear mixed models. The linear mixed model is \[Y=X\beta+Zb+\epsilon\] where \(b\sim N(0, \sigma^2V_\theta)\) and \(\epsilon\sim\sigma^2\), and where \(\theta\) are variance parameters. It’s convenient to write \(b=\Lambda_\theta u\) for iid standard Normal \(u\), where \(\Lambda_\theta\) is then a square root of \(V_\theta\). The penalised least squares approach says that for given \(\theta\), we choose \(b\) and \(\beta\) to minimise \[\|Y-X\beta-Zb\|_2^2+\|u\|_2^2.\]

In the simple case of linear mixed models, the sampling design and model structure line up so that each \(u\) represents one sampling unit at some stage of sampling, and so \(u_i\) has a sampling probability \(\pi^{(u)}_j\). Each residual \(r_i\) will also have a sampling probability, the probability of sampling for that \(Y\), which we can write \(\pi^{(y)}_i\). I’ll write \(R^{(u)}_j\) and \(R^{(y)}_i\) for the corresponding indicator variables that observation \(i\) or random effect \(j\) is sampled.

We can easily estimate the population penalised residual sum of squares by \[\sum_i \frac{R^{(y)}_i}{\pi^{(y)}_i}(Y-X\beta-Zb)^2+\sum_j \frac{R^{(u)}_j}{\pi^{(u)}_j}u_j^2.\] More tidily, write \(w\) for weights that are the reciprocals of \(\pi\): \[\sum_{i\textrm{ in sample}} w^{(y)}_i(Y-X\beta-Zb)^2+\sum_{j\textrm{ in sample}} w^{(u)}_iu_j^2.\] This doesn’t work.

The problem is that there’s a tradeoff between the two terms, and this tradeoff depends on the sample size. The first term wants to make \(b\) larger, to capture more of the residual; the second term wants to make \(b\) smaller, to fit with the penalising prior.

If a given tradeoff would be optimal in the population, with, say, 20 observations per cluster, it can’t be optimal in the sample with, say, 4 observations per cluster. Using weights this way will lead to \(b\) being too large, with larger variance than \(\sigma^2V_\theta\), so the variance components will be underestimated.

You might think this couldn’t happen because the weighted penalised sum of squares is an unbiased estimator of the population penalised sum of squares. Being unbiased, though, is nowhere near enough when the number of parameters is increasing with the sample size. In a setting where the sample size in every cluster goes to infinity, maximising the weighted penalised sum of squares does work, but that’s not usually the relevant asymptotics.

You do get consistent estimation of \(\beta\) without cluster sizes going to infinity, if the mean model is correct, but if the mean model is correct you get consistent estimation of \(\beta\) just from ordinary weighted linear regression so that’s not very impressive.

To make estimation work, we need some fix to the penalised sum of squares. The literature only provides *ad hoc* approaches. I’ve implemented one simple one: rescale \(w^{(u)}\) so that the sum of \(w^{(y)}\) over a cluster equals the sample size for the cluster times \(w^{(u)}\). Unfortunately, this rescaling means that even the estimators of \(\beta\) can be inconsistent under sufficiently perverse sampling schemes.

I’m implementing this approach (for two-stage designs/two-level models so far) in a branch of `svylme`

. As you might guess from the branch name, the estimator is the same as (a special case of) those developed by Sophia Rabe-Hesketh and Anders Skrondal in their foundational and widely-used package for Stata.

My implementation relies on the `lme4pureR`

package, which provides a pure R implementation of `lmer`

. Being pure R, it’s a lot easier to fiddle with than the full `lmer`

, especially with the help of the JSS paper on `lme4`

and it turns out to be fast enough at the moment. I use `lme4::lmer`

for starting values and data setup.

Here’s an example, from the PISA study of schools. I downloaded the data using the `PISA2012lite`

package, and cut it down to just New Zealand data on maths. I’m going to fit a model looking at maths achievment in terms of gender (`ST04Q01`

), some school characteristics, and some individual learning characteristics.

`library(svylme)`

`## Loading required package: survey`

`## Loading required package: grid`

`## Loading required package: Matrix`

`## Loading required package: survival`

```
##
## Attaching package: 'survey'
```

```
## The following object is masked from 'package:graphics':
##
## dotchart
```

```
data(nzmaths)
des<-svydesign(id=~SCHOOLID+STIDSTD, strata=~STRATUM, nest=TRUE,
weights=~W_FSCHWT+condwt, data=nzmaths)
options(survey.lonely.psu="average")
```

First, we specify the survey design using `survey::svydesign`

. One key difference from many survey analyses is that we need probabilities or weights *at each stage of sampling*, not just overall. We need these to construct \(w^{(u)}\) – that’s not just a code limitation, it’s a feature of the problem. Also, because we have some strata with only a single school, we need to pick some approach to single-PSU variance estimation.

Now, the model:

```
svyseqlme(PV1MATH~ (1+ ST04Q01 |SCHOOLID)+ST04Q01*(PCGIRLS+SMRATIO)+MATHEFF+OPENPS,
design=des,scale="sample_size")
```

```
## Linear mixed model fitted by sequential pseudolikelihood
## Formula: NULL
## Random effects:
## Std.Dev.
## (Intercept) 26.8449
## ST04Q01Male 4.5293
## Residual: 69.9843
## Fixed effects:
## beta SE t p
## (Intercept) 465.5453 23.3111 19.9710 0.0000
## ST04Q01Male 52.7128 24.5357 2.1484 0.0317
## PCGIRLS 61.0298 18.5572 3.2887 0.0010
## SMRATIO 0.0640 0.1394 0.4590 0.6462
## MATHEFF 40.4074 2.6053 15.5100 0.0000
## OPENPS 16.6688 2.5697 6.4868 0.0000
## ST04Q01Male:PCGIRLS -100.5453 32.4310 -3.1003 0.0019
## ST04Q01Male:SMRATIO -0.0099 0.1210 -0.0822 0.9345
```

The individual characters `MATHEFF`

and `OPENPS`

are indexes of mathematics self-efficacy and openness to problem-solving. Roughly speaking,if you score high on these, you think you can do maths and you like solving problems. The school characteristics are the percentage of girls at the school and the staff:student ratio in mathematics. `PV1MATH`

has that strange name because it’s one of five ‘plausible values’ for the maths score – ‘plausible values’ are the education term for multiple imputations.

We see that girls do better in schools with more girls, and boys do worse. The staff:student ratio doesn’t seem to matter, but students with higher maths self-efficacy and openness to problem solving do better. There’s quite a lot of variability between schools (the random intercept standard deviation is 26.8) but not that much variability in the gender association.

For comparison, here’s the same model fitted by pairwise composite likelihood, the other approach we’ve been studying:

```
svy2lme(PV1MATH~ (1+ ST04Q01 |SCHOOLID)+ST04Q01*(PCGIRLS+SMRATIO)+MATHEFF+OPENPS,
design=des)
```

```
## Linear mixed model fitted by pairwise likelihood
## Formula: PV1MATH ~ (1 + ST04Q01 | SCHOOLID) + ST04Q01 * (PCGIRLS + SMRATIO) +
## MATHEFF + OPENPS
## Random effects:
## Std.Dev.
## (Intercept) 22.9365
## ST04Q01Male 10.8911
## Residual: 69.4969
## Fixed effects:
## beta SE t p
## (Intercept) 490.7900 16.7334 29.3299 0.0000
## ST04Q01Male 69.4271 20.6898 3.3556 0.0008
## PCGIRLS 54.2815 16.2410 3.3423 0.0008
## SMRATIO -0.0287 0.0886 -0.3243 0.7457
## MATHEFF 46.4944 1.9748 23.5438 0.0000
## OPENPS 14.0187 2.6437 5.3027 0.0000
## ST04Q01Male:PCGIRLS -128.9796 32.7574 -3.9374 0.0001
## ST04Q01Male:SMRATIO -0.0709 0.1144 -0.6194 0.5357
```