What’s the right proof of the Continuous Mapping Theorem?

The Continuous Mapping Theorem says that if \(X_n\stackrel{d}{\to}X\) and \(f\) is continuous except at a set of points with zero probability under \(X\), that \(f(X_n)\stackrel{d}{\to}f(X)\). As David Pollard points out, it should be called the almost-everywhere-continuous mapping theorem, because the ability to have discontinuities is important in applications and is the only thing making the proof non-trivial.

There are three proofs that I’m aware of

Mann and Wald used the ‘pointwise convergence of cdfs’ definition of convergence in distribution, which gives a painful proof
The modern standard is to use the definition of convergence in distribution in terms of expectations of bounded continuous functions. You then rewrite the integral in terms of horizontal rather than vertical slices and use the Portmanteau Lemma # to show the discontinuities don’t matter.
The easy way: use almost-sure representations to construct a sequence \(\tilde{X}_n \stackrel{a.s.}{\to} \tilde{X}\) with the same distributions.

Mathematically, approach 3 looks like overkill. Certainly, if you had to prove the representation theorem (beyond the univariate real case) it would take longer than just using approach 2, even if you had to do it barehanded without the Portmanteau Lemma.

From the statistics point of view I’m still attracted to approach 3. A lot of the time I’m happy to treat advanced probability theory as a black box and just use it to call in air strikes on obstacles in the proof. For example, I do have some clue how the bracketing-entropy Donsker theorem works because I learnt it from Jon Wellner, but the proof isn’t something I’d ever want to teach. The result, on the other hand, is quite useful.

As another example, I recently cited a result on the non-existence of uncountable ascending chains in a Borel preorder. My understanding of the proof is at the level “well, Borel, so some kind of countable transfinite induction thing, right?”, but since lots of actual mathematicians have cited the paper, I trust someone’s checked the details.

On the other hand, if we aren’t making the students prove the Continuous Mapping Theorem from scratch, why are we making them prove it at all? Why not just prove the continuous case, which follows trivially from the definition of weak convergence, and assert that it also works for almost-surely-continuous functions?

To some extent, my justification for approach 3 is that I want to introduce almost-sure representations. The Skorohod univariate proof is very easy to understand in terms of computational random number generation: Skorohod’s construction is that you use the same uniform random numbers to generate each element of the sequence. That computational heuristic makes it plausible that you should be able to do the same thing more generally (and Jakubowski notes that Skorohod’s construction still can be used to prove a slightly weaker result for separable metric spaces). I think proving the Skorohod representation theorem gives useful insight into how convergence in distribution is different from the other modes.

So, approach 2 is probably best in isolation, but in context I still like approach 3.

# which a surprisingly large number of internet posters seem to think takes an apostrophe-s attributing it to an eponymous presumably-French probabilist.