A short note on effect sizes - Biased and Inefficient

From time to time, I get asked about estimating effect sizes in the survey package.

I don’t really use effect sizes, because in the applied fields where I work, people are directly interested in the \(X\) variables they measure. They think about the effect of, say, differences in systolic blood pressure and heart disease risk in terms of blood pressure differences, measured in mmHg. They expect the impact of a 10mmHg difference to be similar in similar populations, and they prefer their \(\beta\)s to be in these units. And while it doesn’t necessarily follow from this, I tend to think of the residual variance in some standard units as being a more natural summary that the dimensionless correlation. The correlation will necessarily change depending on your inclusion criteria; the residual variance need not.

In some fields, people are interested in variables that you can’t measure directly, like conservatism or depression or consumer confidence. Different research projects might try to get a grip on these concepts using different tools, and so one analysis might have a 100-point scale and another might have a 15-point scale for the same concept. In this setting you obviously need some sort of standardisation to specify how many 100-point conservatism points correspond to one 15-point conservatism point. It’s natural to have some standard way to do standardisation, and so it’s not surprising that people tend to think in terms of standardised effect sizes rather than \(\beta\)s in natural units. There aren’t any natural units!

So, I don’t usually do effect sizes. Other people are allowed to want them, but they have to ask; I probably won’t think of implementing them unless you do.

This week I got asked about Cramér’s¹ \(V\). This is a omnibus measure of amount of association in a multiway table, based on Pearson’s \(X^2\). Now, to quote Goodman & Kruskal

The problem is that \(\chi^2\) is not a population quantity. The fact that an excellent test of independence may be based on \(\chi^2\) does not at all mean that \(\chi^2\), or some simple function of it, is an appropriate measure of degree of association. A discussion of this point is presented by R. A. Fisher. We have been unable to find any convincing published defense of \(\chi^2\)-like statistics as measures of association.

This matters even more when you’re trying to incorporate survey design, because the first step is to define a population quantity whose true value doesn’t depend on the design of the survey. But people want Cramér’s \(V\), and it’s interesting to think about how you might define it².

Cramér’s \(V\) for an \(r\times c\) table is usually defined in terms of the Pearson \(X^2\) statistic, as \[V=X^2/N/\min(r-1,c-1).\] It’s useful to consider the case of \(2\times 2\) tables, where at least the problem is one-dimensional. In this setting \(V\) is related to Udny’s \(\phi\), \(V=\phi^2\). If you write \(n_{ij}\) for the \(ij\) cell of the table and \(n_{\cdot j}\) for marginal totals \[\phi=\frac{n_{11}n_{00}-n_{01}n_{10}}{\sqrt{n_{\cdot 1}n_{\cdot 0}n_{1\cdot}n_{0\cdot}}}.\] This does estimate a population quantity under simple random sampling: the same thing with \(N\)s instead of \(n\)s. It’s even an otherwise-meaningful thing; it’s the Pearson correlation between the two indicator variables. You might argue that a tetrachoric correlation or an odds ratio or something might be a better summary, but \(\phi\) is at least a thing.

We can estimate \(\phi\) under complex sampling by putting hats on everything: \[\hat\phi = \frac{\hat N_{11}\hat N_{00}-\hat N_{01}\hat N_{10}}{\sqrt{\hat N_{\cdot 1}\hat N_{\cdot 0}\hat N_{1\cdot}\hat N_{0\cdot}}}.\] And since \(V=\phi^2\), we can estimate \(\hat V=\hat\phi^2\).

Now, the next question is what, if anything, \(V\) estimates when we go beyond \(2\times 2\) tables. Wikipedia ³ says \(V\) is “the mean square canonical correlation between the variables [citation needed]”. Stipulating that Wikipedia is right, \(V\) estimates a well-defined population quantity under simple random sampling and we can just try to estimate the same quantity under other sampling designs.

If so, we need to estimate the population contingency table and population size \(\hat N\), compute the Pearson \(X^2\) statistic on it (say \(\hat X^2\)), and then compute \(\hat X^2/\hat N/\min(r-1,c-1)\). Note that this doesn’t involve design-based association tests – eg the Rao-Scott tests– except to the extent that JNK Rao and Alastair Scott were also motivated by “how do we get useful inferences out of estimated population tables”.

That is:

svycramerV<-function(formula,design,...){
    tbl<-svytable(formula,design,...)
    chisq<-chisq.test(tbl, correct=FALSE)$statistic
    N<-sum(tbl)
    V<-chisq/N/min(dim(tbl)-1)
    names(V)<-"V"
    V
}

I had to look it up, but, yes, it is the same Cramér↩︎
or how you might argue that it’s intrinsically bogus, like a Shapiro-Wilk normality test under complex sampling↩︎
Two cheers for Wikipedia↩︎