Prerequisites - Biased and Inefficient

This week, John Myles White tweeted

One meme I wish would die off: the belief that we can teach high school students statistics without teaching them calculus.

Statistics Twitter was immediately divided between “Preach it, brother!” and “Not cool, dude.” I’m mostly, but not entirely, in the latter camp.

Personally, I did study calculus before taking up statistics, and it helped. In fact, I studied tensor calculus, functional analysis, measure theory, group theory, and differential topology before taking up statistics. They have all helped – but I’m not entirely typical.

I have never seen anyone suggest that topology should be a prerequisite for statistics, but a good understanding of the myriads of ways ‘closeness’ can be defined and the implications of these differences is a basic part of applied statistics. I have never seen anyone suggest that functional analysis should be a prerequisite for statistics, but learning about Hilbert and Banach spaces helped me understand why high-dimensional models and prior distributions are so much less robust than low-dimensional ones.

It goes the other way, too. In discussions of NIH training grants in biostatistics, one hears people arguing that a biostatistician needs a solid understanding of, say, physiology or immunology or genomics, to be really useful in medical research the way Famous Professor Lastname is. It frequently turns out on further investigation that Famous Professor Lastname did a PhD in, as it might be, martingale theory or the identifiability of certain semiparametric models, and that all their biomedical knowledge has been obtained on the job. Maybe it would be better if future biostatisticians were taught this stuff in classes, but the evidence being cited isn’t supportive. A surprisingly wide range of people can usefully study statistics.

If you’re teaching a course in statistics – high school statistics, or linear regression, or in survival analysis, or whatever – it is important to decide whether you are going to assume understanding of calculus or matrices or sigma notation for sums, or simple high-school algebra. To some extent, the more maths you can assume, the more statistics you can cover in a fixed amount of time. If everyone in the class is comfortable with matrices, you can use matrix notation for models. If everyone is comfortable with calculus, you can switch between maximum likelihood and the score equations without much comment. If everyone is comfortable with that level of abstraction, you can talk about causal graphs and define confounding in terms of graph substructures. If everyone is comfortable with algebra, you can write $Y$ and $X$ and talk about variables in general instead of needing to use the names of actual measured variables.

But none of this is necessary for statistics. You can teach more statistics than most scientists need to know without using calculus, matrices, or even summation notation. We did it, in biostatistics courses in Seattle aimed at the scarily smart but math-deprived PhD students in health sciences. And in high school and first-year courses in statistics in New Zealand, we teach statistics, using permutation tests and the bootstrap to substitute for integrals over the sampling distribution, just as we teach the ideas of ‘fair comparisons’ without using causal graphs. My mother didn’t ever study calculus, because at her high school A-level maths and A-level chemistry were in the same time slot. She subsequently learned a reasonable amount of statistics by being involved in evidence-based medicine before it was trendy.

The problem of prerequisites is primarily an opportunity-cost problem. Should students be forced to wait until after they’ve learned calculus to start learning statistics? Is it better to have two semesters of statistics, or one semester of statistics and one of calculus? Or linear algebra, or measure theory, or functional analysis? What about students who’d rather be found dead in a ditch than take calculus? Is statistics necessarily of no use to them?

My sound-bite position on this is “Statistics should be the last thing you learn”. That is, if you’re going to study maths, you should try to do it before you start statistics, because more maths will make statistics easier. But that’s not specific to maths. If you’re going to study genetics or computer science or the history of colonial revolutions, and you’re planning to use statistics, there’s a good chance you’d be better off waiting to learn statistics until after you know what you want to use it for.

But if you want to learn statistics, right now, we can find you a class that doesn’t assume you’re a mathematician or a computer scientist or an astronomer, or a doctor. Maybe someone with a different background and interests could learn faster, but a key part of good statistical practice is thinking carefully about what is the actual question at issue, and that’s not it.