AIC and combined discrete/continuous models

There was a question on StackOverflow¹ about continuous data with extra zeroes and comparing a two-part model to a continuous-only model using AIC. This doesn’t work, but the reason is interesting and some other things that don’t work are less obvious.

For a long time I vaguely wondered about what was going on if you compare a Normal model and a binomial model using AIC. Mechanically, nothing horrible happens. You put the numbers in and you get a number out. The same sort of question arises if you want to compare a beta and a (scaled) binomial model for proportions. All easy, except that the result doesn’t actually mean anything.

Suppose you have a Beta( $a, b$ ) distribution and a scaled binomial distribution $d / n$ with $d \sim B i n o m i a l (n, p)$ for empirical data ${\hat{p}}_{i}$ . If the binomial model is true, all the ${\hat{p}}_{i}$ will be multiples of $1 / n$ (with probability 1). If the Beta model is true, none of them will be multiples of $1 / n$ (with probability 1). The likelihood ratio is infinite in favour of the binomial model in the first scenario and infinite in favour of the Beta model in the second. It’s not a difficult decision. The same is true for a Normal model vs binomial as long as none of the observations are outside the binomial range.

But in R

set.seed(2025-7-22)
Y<-rbinom(100, 20, .3)
cts_model <- glm(Y~1, family=gaussian)
disc_model <- glm(cbind(Y,20-Y)~1,  family=binomial)
AIC(cts_model)

## [1] 424.3969

AIC(disc_model)

## [1] 420.7833

Or, instead of integer $Y$ we could go for proportions and weights

Y1<-Y/20
cts_model1 <- glm(Y1~1, family=gaussian)
disc_model1 <- glm(Y1~1, weight=rep(20,100), family=binomial)
AIC(cts_model1)

## [1] -174.7496

AIC(disc_model1)

## [1] 420.7833

The two discrete models are the same and so they give the same AIC, but there’s a problem with the continuous models. How can it be true that the two models are about equally good on the integer scale but the Normal model is way better on the proportion scale?

In discussions of computing AIC you will see the caution that “all the constants” need to be included. That is, when computing the loglikelihood in, say, a Normal linear regression model, a programmer might exclude factors of, say, $\sqrt{2 π}$ that are the same for all models and drop out of any comparison. That’s fine when comparing Normal linear regression models, but when comparing to a lognormal or Gamma model those terms don’t cancel out, because they aren’t part of the lognormal or Gamma likelihood. You need the full honest-to-Fisher loglikelihood for AIC, not some simplification. The discrete vs continuous problem is exactly the same, only infinitely worse.

The loose way to talk about the problem is that the continuous model’s loglikelihood comes from a probability density function while the discrete model’s loglikelihood comes from a probability mass function, and these are just two different things. You can’t divide one by the other. They’re as different as apples and Wednesday. Give up now and keep your sanity.

To get more precise, we need a little measure theory, so we can represent the two distributions as basically the same sort of thing, and talk about their ratio intelligently². Let $P$ and $Q$ be two probability measures. These are just functions that assign probabilities to sets. What sets? All sets that you’ll ever care about.³ We can obviously talk about ratios: $P (A) / Q (A)$ for any⁴ set $A$ . A bit less obviously, we can talk about limits of ratios $d P / d Q$ for sequences of sets that shrink to a single point. If $P$ and $Q$ were both discrete random variables then $P ({x}) / Q ({x})$ would make sense for any point $x$ , and that’s what $(d P / d Q) (x)$ would be. If $P$ and $Q$ were both absolutely continuous variables with the same range, the ratio of their probability densities at $x$ would make sense and that’s what $(d P / d Q) (x)$ would be. If you allow for $P$ to put probability in some places $Q$ doesn’t by calling the ratio $\infty$ then you can always get $d P / d Q$ and $\int_{A} \frac{d P}{d Q} (x) d Q (x) = \int_{A} d P^{-} (x) = P^{-} (A)$ where $P^{-}$ is the part of $P$ where $d P / d Q$ is finite⁵. If you think about this for discrete variables it make sense: $P^{-} ({x})$ is just $P ({x})$ for points where $Q ({x}) \neq 0$ .

Ok, so now we can talk about probability distributions as the same basic kind of thing whether they are discrete or continuous or horribly worse than either. We can also talk about likelihood ratios $d P / d Q$ . The problem when we compared likelihoods for Normal and binomial models before was that we just used the density function and not the whole probability measure. If you have two discrete variables with the same support this is fine; the likelihood ratio reduces to the ratio of probability mass functions. If you have two continuous variables with the same support the likelihood ratio reduces to the ratio of probability density functions. In general, though, you need all the bits.

We’re now in a position to say what the $d P / d Q$ is for Normal $P$ and binomial $Q$ . It’s zero if all the ${\hat{p}}_{i}$ are multiples of $1 / n$ in $[0, 1]$ and it’s infinity if they aren’t. This is not very productive⁶, but at least it’s correct, which is something.

And, finally, going back to the model for a continuous variable with a spike at zero, we can see there’s a problem. We can’t compare this to a continuous model, because $d P / d Q$ will be infinite at zero from the spike. You might think we could compare it to a different mixed discrete/continuous model. We could, and we could define an AIC analogue that way,⁷ but the answer would depend on how the discrete and continuous parts were weighted, which turns out to be somewhat arbitrary. (Update) For example, suppose your data are medical costs, which is where I first saw this sort of model. If one person analyses the data in US dollars and another analyses the data in Japanese yen, or in US cents, the probability density for the continuous part of the model will change and so will the relative weighting of the discrete and continuous models. You could pick one currency and stick in Jacobians to get consistent comparisons, but that’s what I mean by “somewhat arbitrary”.

None of this problem arises for purely discrete two-part models such as the zero-inflated Poisson, which can straightforwardly be compared to the Poisson because they both put probability on all the non-negative integers. The likelihood ratios are just the ratios of these probabilities and are perfectly well defined.

or perhaps StackUnderflow, these days↩︎
though, as it turns out, entirely unproductively↩︎
rhubarb, rhubarb, axiom of choice, rhubarb, rhubarb, Solovay’s model, rhubarb, inaccessible cardinal↩︎
reasonable↩︎
more precisely, the part that’s absolutely continuous with respect to $Q$ . This is the Radon-Nikodym theorem plus the Lebesgue decomposition theorem. See Wikipedia ↩︎
as I warned you↩︎
AIC analogues can be defined for lots of optimisation estimators, even for things like survey estimation↩︎