## The issue

A paper reporting the induction of autism-type behaviour in mice by fecal microbiome transplants from humans was recently published in Cell. Some people on Twitter were discussing subplots E, F, and G of Figure 1, which report behavioral comparisons of the mice between fecal donors with and without ASD.

The expressed view on Twitter was the the plots weren’t consistent with the \(p\)-values given. They didn’t entirely need to be, since the \(p\)-values weren’t from a simple two-group comparison, but even taking that into account I was surprised. I thought it was worth doing the actual comparison the researchers reported before going all ‘someone is wrong on the internet’ about it. That turned out to be relatively tricky.

## The data

The data are available from https://doi.org/10.17632/ngzmj4zkms. I downloaded the data for Figure 1, parts E–H.

`cell<-read.csv("https://data.mendeley.com/datasets/ngzmj4zkms/1/files/fb7927df-a1ea-4257-bf2f-33fecc4090e9/Fig1EFGH_subset8.csv?dl=1")`

First, check the plots to make sure the data look right

```
library(beeswarm)
cell$ASD_diagnosis<-relevel(cell$ASD_diagnosis, ref="NT")
beeswarm(MB_buried~ASD_diagnosis,data=cell,method="center",pch=19,pwcol=Donor)
```

`beeswarm(DSI_Social_duration~ASD_diagnosis,data=cell,pch=19,pwcol=Donor)`

`beeswarm(OFT_Distance~ASD_diagnosis,data=cell,pch=19,pwcol=Donor)`

These look like the ones in the paper. The first one shows a visually convincing difference; the other two do not. We can easily do formal tests that correspond to these visual comparisons

`t.test(MB_buried~ASD_diagnosis,data=cell)`

```
##
## Welch Two Sample t-test
##
## data: MB_buried by ASD_diagnosis
## t = -4.5452, df = 172.79, p-value = 1.028e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.279629 -1.688091
## sample estimates:
## mean in group NT mean in group ASD
## 5.776471 8.760331
```

`t.test(DSI_Social_duration~ASD_diagnosis,data=cell)`

```
##
## Welch Two Sample t-test
##
## data: DSI_Social_duration by ASD_diagnosis
## t = 2.1568, df = 88.93, p-value = 0.03372
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7446561 18.1694410
## sample estimates:
## mean in group NT mean in group ASD
## 40.57806 31.12101
```

`t.test(OFT_Distance~ASD_diagnosis,data=cell)`

```
##
## Welch Two Sample t-test
##
## data: OFT_Distance by ASD_diagnosis
## t = 1.9731, df = 144.85, p-value = 0.05039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5077242 592.3180052
## sample estimates:
## mean in group NT mean in group ASD
## 4548.462 4252.557
```

However, the aggregate comparison is wrong. The researchers say in the panel caption that *“Hypothesis testing for differences of the means were done by a mixed effects analysis using donor diagnosis and mouse sex as fixed effects and donor ID as a random effect.”*

That’s a plausible analysis. So let’s do it. I’ll use `lmer`

, with the `lmerTest`

package to get small-sample approximations.

`library(lmerTest)`

`## Loading required package: lme4`

`## Loading required package: Matrix`

```
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
```

```
##
## Attaching package: 'lmerTest'
```

```
## The following object is masked from 'package:lme4':
##
## lmer
```

```
## The following object is masked from 'package:stats':
##
## step
```

```
marbles<-lmer(MB_buried~ASD_diagnosis+Gender+(1|Donor),data=cell)
dsi<-lmer(DSI_Social_duration~ASD_diagnosis+Gender+(1|Donor),data=cell)
oft<-lmer(OFT_Distance~ASD_diagnosis+Gender+(1|Donor),data=cell)
summary(marbles)
```

```
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: MB_buried ~ ASD_diagnosis + Gender + (1 | Donor)
## Data: cell
##
## REML criterion at convergence: 1203.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8852 -0.8420 -0.1855 0.7049 2.8853
##
## Random effects:
## Groups Name Variance Std.Dev.
## Donor (Intercept) 0.8052 0.8973
## Residual 20.1949 4.4939
## Number of obs: 206, groups: Donor, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 6.5928 0.7950 7.7089 8.293 4.18e-05 ***
## ASD_diagnosisASD 2.6777 0.9292 5.5842 2.882 0.0304 *
## GenderMale -1.2002 0.6315 201.1337 -1.901 0.0588 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) ASD_AS
## ASD_dgnsASD -0.718
## GenderMale -0.420 0.031
```

`summary(dsi)`

```
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: DSI_Social_duration ~ ASD_diagnosis + Gender + (1 | Donor)
## Data: cell
##
## REML criterion at convergence: 1109.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8408 -0.7024 -0.1325 0.6461 3.0087
##
## Random effects:
## Groups Name Variance Std.Dev.
## Donor (Intercept) 22.32 4.724
## Residual 369.96 19.234
## Number of obs: 128, groups: Donor, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 27.064 4.303 8.919 6.290 0.000148 ***
## ASD_diagnosisASD -8.097 4.917 5.948 -1.647 0.151138
## GenderMale 24.082 3.418 120.475 7.045 1.24e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) ASD_AS
## ASD_dgnsASD -0.722
## GenderMale -0.435 0.029
```

`summary(oft)`

```
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: OFT_Distance ~ ASD_diagnosis + Gender + (1 | Donor)
## Data: cell
##
## REML criterion at convergence: 3375.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.5439 -0.6125 -0.1220 0.4504 3.3174
##
## Random effects:
## Groups Name Variance Std.Dev.
## Donor (Intercept) 166783 408.4
## Residual 867930 931.6
## Number of obs: 206, groups: Donor, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 4658.881 267.832 7.003 17.395 5.08e-07 ***
## ASD_diagnosisASD -369.892 329.402 6.256 -1.123 0.303
## GenderMale -111.803 131.383 198.647 -0.851 0.396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) ASD_AS
## ASD_dgnsASD -0.763
## GenderMale -0.259 0.017
```

`anova(marbles)`

```
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 167.713 167.713 1 5.584 8.3047 0.03037 *
## Gender 72.953 72.953 1 201.134 3.6124 0.05878 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(dsi)`

```
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 1003.3 1003.3 1 5.948 2.7119 0.1511
## Gender 18364.3 18364.3 1 120.475 49.6382 1.241e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(oft)`

```
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 1094412 1094412 1 6.256 1.2609 0.3027
## Gender 628519 628519 1 198.647 0.7242 0.3958
```

Taking account of within-donor correlation reduces the effective sample size and the standard errors get much bigger. There’s modest evidence for the marbles, but basically none for the other outcomes. We could redo with the (slightly improved) Kenward-Roger version of the anova tests.

`anova(marbles, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 163.812 163.812 1 5.331 8.1115 0.03339 *
## Gender 72.224 72.224 1 201.044 3.5764 0.06005 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(dsi, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 1002.7 1002.7 1 5.932 2.7103 0.1514
## Gender 18279.3 18279.3 1 120.459 49.4084 1.349e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(oft, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 1090931 1090931 1 5.829 1.2569 0.3063
## Gender 626912 626912 1 198.326 0.7223 0.3964
```

Still unimpressive.

If we look at the statistical analysis section at the end of the paper it says *“Comparison of behavioral outcomes between TD Controls and ASD donors were tested using longitudinal linear mixed effects analyses, with test cycles and donors treated as repeated factors. Analyses were performed in SPSS (v 24);a priori alpha = 0.05. All outcomes were tested for normality and trans-formed as required. Diagonal covariance matrices were used so that intra-cycle and intra-donor correlations were accounted for inthe modeling.”*

I’m not sure quite what that means– mice don’t repeat over donors or cycles– but it would make sense to have a `Cycle`

effect for the marbles and OFT analyses; the DSI was only done in one cycle, so there can’t be a “cycle effect”. The impact shouldn’t be huge, since cycle is roughly orthogonal to ASD. With only two cycles, I would have made it a fixed effect, but we can try either way.

```
library(lmerTest)
marbles<-lmer(MB_buried~ASD_diagnosis+Gender+(1|Cycle)+(1|Donor),data=cell)
```

```
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.0219471
## (tol = 0.002, component 1)
```

```
oft<-lmer(OFT_Distance~ASD_diagnosis+Gender+(1|Cycle)+(1|Donor),data=cell)
anova(marbles, ddf="Kenward-Roger")
```

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 160.758 160.758 1 5.315 7.9603 0.03461 *
## Gender 70.849 70.849 1 201.039 3.5083 0.06252 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(oft, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 1054021 1054021 1 5.808 1.2162 0.3137
## Gender 593327 593327 1 198.253 0.6846 0.4090
```

```
marbles2<-lmer(MB_buried~ASD_diagnosis+Gender+Cycle+(1|Donor),data=cell)
oft2<-lmer(OFT_Distance~ASD_diagnosis+Gender+Cycle+(1|Donor),data=cell)
anova(marbles2, ddf="Kenward-Roger")
```

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 156.791 156.791 1 5.138 7.7419 0.03766 *
## Gender 70.145 70.145 1 199.850 3.4636 0.06420 .
## Cycle 2.978 2.978 1 127.464 0.1471 0.70199
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(oft2, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 949822 949822 1 5.741 1.0971 0.3370
## Gender 511565 511565 1 197.269 0.5909 0.4430
## Cycle 1054078 1054078 1 194.334 1.2175 0.2712
```

And it doesn’t matter much.

The Statistical analysis section also says *“Additional statistical analysis was done using R (3.4.1) or Python (3.6.4), using various packages to test mixed effects testing diagnosis (TD or ASD) as a fixed effect and donor and testing round as random effects.”*. I think ‘various packages’ is a bit unspecific, but let’s try `Round`

as a random effect.

```
library(lmerTest)
marbles<-lmer(MB_buried~ASD_diagnosis+Gender+(1|Round)+(1|Donor),data=cell)
```

`## boundary (singular) fit: see ?isSingular`

`dsi<-lmer(DSI_Social_duration~ASD_diagnosis+Gender+(1|Round)+(1|Donor),data=cell)`

`## boundary (singular) fit: see ?isSingular`

```
oft<-lmer(OFT_Distance~ASD_diagnosis+Gender+(1|Round)+(1|Donor),data=cell)
anova(marbles, ddf="Kenward-Roger")
```

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 156.844 156.844 1 5.189 7.7666 0.03705 *
## Gender 70.447 70.447 1 200.926 3.4884 0.06326 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(dsi, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 738.4 738.4 1 5.597 1.9957 0.2109
## Gender 18228.0 18228.0 1 120.455 49.2689 1.419e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(oft, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 853223 853223 1 4.847 1.0655 0.3507
## Gender 901440 901440 1 196.177 1.1257 0.2900
```

Again, not much change. *(The warning message is harmless: it jut means that that Round random effect is estimated to have zero variance.)*

There’s an Excel spreadsheet with test statistics and p-values as table S4. It doesn’t say exactly what the tests were, and I don’t have access to SPSS on a holiday long weekend, but there are two notable things about it. First, it has interaction p-values for Diagnosis by Gender. Second, the F statistics for Diagnosis have 169, 79, and 95 denominator degrees of freedom for the marbles, DSI, and OFT outcomes respectively.

Information for the Diagnosis contrast only comes from comparisons between the 5 donor humans with ASD and the 3 donor controls, so the degrees of freedom should be no more than 7, and probably less.

Let’s try one more time, though, with a sex by diagnosis interaction:

```
library(lmerTest)
marbles<-lmer(MB_buried~ASD_diagnosis*Gender+(1|Round)+(1|Donor),data=cell)
```

`## boundary (singular) fit: see ?isSingular`

`dsi<-lmer(DSI_Social_duration~ASD_diagnosis*Gender+(1|Round)+(1|Donor),data=cell)`

`## boundary (singular) fit: see ?isSingular`

```
oft<-lmer(OFT_Distance~ASD_diagnosis*Gender+(1|Round)+(1|Donor),data=cell)
anova(marbles, ddf="Kenward-Roger")
```

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 159.509 159.509 1 5.135 7.9685 0.03590 *
## Gender 92.203 92.203 1 200.360 4.6061 0.03306 *
## ASD_diagnosis:Gender 58.626 58.626 1 200.073 2.9287 0.08857 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(dsi, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 736.2 736.2 1 5.595 1.9838 0.2121
## Gender 18289.0 18289.0 1 120.347 49.2836 1.416e-10 ***
## ASD_diagnosis:Gender 338.6 338.6 1 120.323 0.9126 0.3414
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(oft, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_diagnosis 850027 850027 1 4.844 1.0565 0.3526
## Gender 823622 823622 1 195.277 1.0237 0.3129
## ASD_diagnosis:Gender 23723 23723 1 194.337 0.0295 0.8638
```

and, finally, using sum-to-zero contrasts instead of treatment contrasts

```
library(lmerTest)
cell$ASD_sum<-C(cell$ASD_diagnosis,"contr.sum")
cell$Gender_sum<-C(cell$Gender,"contr.sum")
marbles<-lmer(MB_buried~ASD_sum*Gender_sum+(1|Round)+(1|Donor),data=cell)
```

`## boundary (singular) fit: see ?isSingular`

`dsi<-lmer(DSI_Social_duration~ASD_sum*Gender_sum+(1|Round)+(1|Donor),data=cell)`

`## boundary (singular) fit: see ?isSingular`

```
oft<-lmer(OFT_Distance~ASD_sum*Gender_sum+(1|Round)+(1|Donor),data=cell)
anova(marbles, ddf="Kenward-Roger")
```

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_sum 159.509 159.509 1 5.135 7.9685 0.03590 *
## Gender_sum 92.203 92.203 1 200.360 4.6061 0.03306 *
## ASD_sum:Gender_sum 58.626 58.626 1 200.073 2.9287 0.08857 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(dsi, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_sum 736.2 736.2 1 5.595 1.9838 0.2121
## Gender_sum 18289.0 18289.0 1 120.347 49.2836 1.416e-10 ***
## ASD_sum:Gender_sum 338.6 338.6 1 120.323 0.9126 0.3414
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(oft, ddf="Kenward-Roger")`

```
## Type III Analysis of Variance Table with Kenward-Roger's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## ASD_sum 850027 850027 1 4.844 1.0565 0.3526
## Gender_sum 823622 823622 1 195.277 1.0237 0.3129
## ASD_sum:Gender_sum 23723 23723 1 194.337 0.0295 0.8638
```

So, in the end I can’t reproduce anything close to the \(p\)-values given in the paper. I suspect based on the analyses I’ve done and the degrees of freedom given in the paper that the \(p\)-values in the paper are wrong. However, since I can’t come close to reproducing them, I can’t be sure.

Given that the authors were clearly computationally literate and familiar with mixed model software in R, it’s a pity these analyses weren’t done in R (or Python or something) so that code could be distributed with the data.

## A New Hope

I tried to work out what the SPSS commands reported in the paper might be doing, in particular the bit about “diagonal covariance matrices”. SPSS has an option for “Diagonal” in specifying within-subject covariance matrices, which models heteroscedasticity but not correlation. If you already had random effects in the model that would be what you wanted, but if you didn’t it would end up treating all the observations as independent.

It’s a new day, and I could try this with a copy of SPSS. Now I can duplicate the test statistics and p-values exactly with the following SPSS syntax (generated by the SPSS GUI)

```
MIXED OFT_Distance BY Gender ASD_diagnosis
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=Gender ASD_diagnosis Gender*ASD_diagnosis | SSTYPE(3)
/METHOD=REML
/REPEATED=Donor | SUBJECT(V1) COVTYPE(DIAG)
```

and then replacing `OFT_Distance`

with `MB_buried`

or `DSI_Social_duration`

.

The output in SPSS is like this (only pretty)

```
Type III Tests of Fixed Effects a
Source Numerator df Denominator df F Sig.
Intercept 1 95.967 4336.047 .000
Gender 1 95.967 2.378 .126
ASD_diagnosis 1 95.967 15.346 .000
Gender * ASD_diagnosis 1 95.967 1.871 .175
a Dependent Variable: OFT_Distance.
```

This, as far as I can tell from the documentation, is a regression model with all observations independent, but with the variance depending on `Donor`

. Which is not an analysis I would consider to be appropriate. To get a random effect of donor in SPSS, you’d use a `/RANDOM = Donor`

line.

I can at least approximately get this analysis in R using `gls`

. There are a lot of details in how these models are fitted, and SPSS and R’s `gls`

have differences. But the model it fits is the one I think SPSS is fitting and the results for the \(F\)-statistics are almost identical. The degrees of freedom are off, and therefore the p-values aren’t the same, because `gls`

uses a much cruder df computation. However, the p-values are close because the analysis (incorrectly) specifies that there are lots and lots of denominator degrees of freedom: the difference between 95 and 202 doesn’t make much difference.

`library(nlme)`

```
##
## Attaching package: 'nlme'
```

```
## The following object is masked from 'package:lme4':
##
## lmList
```

```
cell$ASD_sum<-C(cell$ASD_diagnosis,"contr.sum")
cell$Gender_sum<-C(cell$Gender,"contr.sum")
marbles_gls <- gls(MB_buried~ASD_sum*Gender_sum, weights=varIdent(form=~1|Donor),data=cell)
dsi_gls <- gls(DSI_Social_duration~ASD_sum*Gender_sum, weights=varIdent(form=~1|Donor),data=cell,na.action=na.omit)
oft_gls <- gls(OFT_Distance~ASD_sum*Gender_sum, weights=varIdent(form=~1|Donor),data=cell)
marbles_gls
```

```
## Generalized least squares fit by REML
## Model: MB_buried ~ ASD_sum * Gender_sum
## Data: cell
## Log-restricted-likelihood: -600.7224
##
## Coefficients:
## (Intercept) ASD_sum1 Gender_sum1
## 7.2030998 -1.4353428 0.7632371
## ASD_sum1:Gender_sum1
## 0.6163910
##
## Variance function:
## Structure: Different standard deviations per stratum
## Formula: ~1 | Donor
## Parameter estimates:
## C1 A24-new A11-old N5 A3-old C4 A9-old A17-old
## 1.000000 0.994274 1.220736 1.209645 1.144691 1.360156 1.152140 1.111970
## Degrees of freedom: 206 total; 202 residual
## Residual standard error: 3.980873
```

`dsi_gls`

```
## Generalized least squares fit by REML
## Model: DSI_Social_duration ~ ASD_sum * Gender_sum
## Data: cell
## Log-restricted-likelihood: -550.6178
##
## Coefficients:
## (Intercept) ASD_sum1 Gender_sum1
## 33.704610 4.155542 -12.627194
## ASD_sum1:Gender_sum1
## -2.986802
##
## Variance function:
## Structure: Different standard deviations per stratum
## Formula: ~1 | Donor
## Parameter estimates:
## C1 A3-old N5 A24-new C4 A9-old A17-old
## 1.0000000 0.6792388 0.7606953 0.8916764 0.6296550 0.5832156 0.5887382
## A11-old
## 0.8586349
## Degrees of freedom: 128 total; 124 residual
## Residual standard error: 25.72352
```

`oft_gls`

```
## Generalized least squares fit by REML
## Model: OFT_Distance ~ ASD_sum * Gender_sum
## Data: cell
## Log-restricted-likelihood: -1682.108
##
## Coefficients:
## (Intercept) ASD_sum1 Gender_sum1
## 4479.80622 266.50266 104.90719
## ASD_sum1:Gender_sum1
## 93.06088
##
## Variance function:
## Structure: Different standard deviations per stratum
## Formula: ~1 | Donor
## Parameter estimates:
## C1 A24-new A11-old N5 A3-old C4 A9-old
## 1.0000000 0.5168937 0.7423241 0.8736908 0.7344225 0.5089632 0.4638039
## A17-old
## 0.6680891
## Degrees of freedom: 206 total; 202 residual
## Residual standard error: 1362.165
```

`anova(marbles_gls, type="marginal")`

```
## Denom. DF: 202
## numDF F-value p-value
## (Intercept) 1 515.6815 <.0001
## ASD_sum 1 20.4764 <.0001
## Gender_sum 1 5.7898 0.0170
## ASD_sum:Gender_sum 1 3.7762 0.0534
```

`anova(dsi_gls,type="marginal")`

```
## Denom. DF: 124
## numDF F-value p-value
## (Intercept) 1 385.0856 <.0001
## ASD_sum 1 5.8537 0.0170
## Gender_sum 1 54.0496 <.0001
## ASD_sum:Gender_sum 1 3.0241 0.0845
```

`anova(oft_gls,type="marginal")`

```
## Denom. DF: 202
## numDF F-value p-value
## (Intercept) 1 4336.040 <.0001
## ASD_sum 1 15.345 0.0001
## Gender_sum 1 2.378 0.1246
## ASD_sum:Gender_sum 1 1.871 0.1729
```

In conclusion, I’m fairly confident I understand the analysis, and I think it’s wrong, and arguably due in part to a poor GUI design from SPSS. On the other hand, the people who looked at the dotplots thinking them to be independent points, and said you couldn’t get those p-values without some sort of paired analysis were also wrong.

## The researchers write back

I reported my concerns to the first author on the paper, and got an email thanking me for my effort and saying they’d publish analysis code ‘soon’.

They now have

As you can see, this is basically the model I thought they were fitting. The `SUBJECT(Round*MouseID)`

specification specifies that `Round*MouseID`

identifies *independent* subjects. Since these subjects have only one observation each, there’s no modelling of correlation at all, but the variance is allowed to depend on donor.

They also gave a new description

Statistical analyses, part of Figure 1, specify linear mixed effects modeling of individual mice, nested within donor, and round of testing. Fixed effects included donor ASD diagnosis, mouse sex, and the interaction of these factors. You will note that we also ran models stratified by mouse sex, because DSI was only evident in male mice. Diagonal covariance matrices were specified so that mice sharing donors would be allowed to share variability while maintaining independence from other donors. Likewise, this was also used for testing Round. Because the number of donors was not large, more complex models specifying random effects did not converge and were not stable, and thus we decided to adhere to a parsimonious model to address our research question. We are providing the original SPSS output of the statistical analysis presented in Figures 1E –G for reference; the syntax use for analyses is included therein

Without the code, the description would still be potentially misleading. Since there **should be** a random effect for donor, it would be natural to interpreted *“mixed effects modeling of individual mice, nested within donor”* as a donor random effect. The same goes for *“mice sharing donors would be allowed to share variability while maintaining independence from other donors”*. The model **does not** have a random effect for donor. Mice sharing the donors are assumed to have the same mean for the behavioural outcomes, and only allowed to have different variance. The correlation between mice having the same donor is not incorporated in the model. Even assuming these were the only behavioural outcomes they looked at, the data provide very little support for an effect of donor ASD status.

I’m also dubious about *“more complex models specifying random effects did not converge and were not stable”*. A whole bunch of models like that are in this post, and converged perfectly well for me.

## The git clone wars

This sort of thing is much simpler when you distribute code as well as data, which is unfortunately not standard even in statistics.

The code you used will always specify the analysis you performed, and while I still would have had to work out what the SPSS code did, it would have been simpler than trying to reverse-engineer the analysis. The reviewers would have had more opportunities to notice something wrong: the surprisingly small p-value, the unreasonably large denominator degrees of freedom, and the code itself.