Biased and Inefficient

Biased and Inefficient http://notstatschat.netlify.com/ Recent content on Biased and Inefficient Hugo -- gohugo.io en-us Tue, 16 Jul 2024 00:00:00 +0000 A Bayesian t-test, again http://notstatschat.netlify.com/2024/07/16/a-bayesian-t-test-again/ Tue, 16 Jul 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/07/16/a-bayesian-t-test-again/ The term “t-test” is a bit of a troll here, since I don’t mean either a test or the Normal-based inference developed by Gossett. I’m interested in two-sample comparisons of means, effectively non-parametric in moderate to large samples. A frequentist in an elementary course would do this by saying $\bar X$ and $\bar Y$ are each (roughly) Normal by the Central Limit Theorem, so $\bar X-\bar Y$ is also (roughly) Normal, giving interval estimates and, if necessary, tail probabilities. Stage vs phase http://notstatschat.netlify.com/2024/06/28/stage-vs-phase/ Fri, 28 Jun 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/06/28/stage-vs-phase/ When two-phase study designs started being used in epidemiology and biostatistics there was a period of conflict. Survey statisticians insisted on the term “two-phase” and biostatisticians (following survey textbooks in some cases) wanted to call these “two-stage” designs. Like the correct pronounciation of ‘Scheveningen’1, the terminology identified communities. In a $K$-stage survey design we have sampling units (clusters) at stage 1, smaller ones at stage 2, and so on. You can compute a probability $\pi_i=\pi_{i,1}\times \pi_{i,2|1}\times \cdots\times\pi_{i,K|K-1}$, where $\pi_{i,1}$ is the probability that unit $i$ is sampled at stage 1, $\pi_{i,2|1}$ is the probability that unit $i$ is sampled at stage 2 given that it is sampled at stage 1, and so on. Estimator vs estimate http://notstatschat.netlify.com/2024/06/27/estimator-vs-estimate/ Thu, 27 Jun 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/06/27/estimator-vs-estimate/ In sandwich variance estimators, the middle of the sandwich1 is the variance of the estimating function. If we have independent observations and estimate by solving \[\sum_i U_i(\theta)=0\] we want $\mathrm{var}[\sum_i U_i(\theta)]$, which we estimate by $\sum_i U_i(\hat\theta)U_i(\hat\theta)^T$. There are two issues here you might miss. First, some people (younger me, for example) worry that $\sum U_i(\hat\theta)$ is always zero, by construction, so that surely $\mathrm{var}\left[\sum U_i(\hat\theta)\right]$ will also be zero. Automatic transformation of standard errors? http://notstatschat.netlify.com/2024/06/15/automatic-transformation-of-standard-errors/ Sat, 15 Jun 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/06/15/automatic-transformation-of-standard-errors/ The survey package returns many results as svystat objects, which are numeric vectors with variance matrix as an attribute (and other optional attributes). Because they’re not made of magic, if you transform the point estimate the variance matrix doesn’t transform and is no longer appropriate. But what if they were made of magic? We have svycontrast to do delta-method transformations and we have the Math and Ops group generic functions, so it should be possible to just have the variances transform. S3 method dispatch on other arguments http://notstatschat.netlify.com/2024/06/04/s3-method-dispatch-on-other-arguments/ Tue, 04 Jun 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/06/04/s3-method-dispatch-on-other-arguments/ The S3 method system only lets you dispatch methods on one argument of the generic. Most people use the first argument, and it’s not unheard of for people to claim that only the first argument is allowed. Actually, other arguments can be used! What’s more, if you write functions using the old-school formula/data structure, there’s a genuine reason to dispatch on the second argument. Let’s look at the survey package and the simplest estimation function of all, svytotal Crossvalidation in complex survey data http://notstatschat.netlify.com/2024/05/21/crossvalidation-in-complex-survey-data/ Tue, 21 May 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/05/21/crossvalidation-in-complex-survey-data/ Background The current development of the survey package now has an experimental implementation of cross-validation using replicate-weight decompositions of the data. This is experimental. It is liable to change, and may contain nuts. The basic idea, as studied by Amaia Iparraguirre is to decompose survey data in ways that respect the structure of the sampling1. Complex survey data typically have strata and clusters. The strata are a partition of the population into groups that we hope are different. Choosing frame weights in dual-frame surveys http://notstatschat.netlify.com/2024/05/10/choosing-frame-weights-in-dual-frame-surveys/ Fri, 10 May 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/05/10/choosing-frame-weights-in-dual-frame-surveys/ In dual-frame sampling you take two samples from overlapping sampling frames and you need to downweight people who could have been chosen in either frame so the overlap of the two frames isn’t counted twice. Suppose you have some constant value $\theta$ to do the downweighting, so that people in the overlap who were sampled from frame $A$ get their weight multiplied by $\theta$ and people in the overlap who were sampled from frame $B$ get their weight multiplied by $(1-\theta)$. Another update on non-transitive dice http://notstatschat.netlify.com/2024/04/29/another-update-on-non-transitive-dice-and-the-wilcoxon-test/ Mon, 29 Apr 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/04/29/another-update-on-non-transitive-dice-and-the-wilcoxon-test/ I’ve mentioned before that mathematician Tim Gowers had run a ‘polymath’ (massively collaborative maths research) project on non-transitive dice. There’s an arXiv preprint. There’s also a detailed write-up in Quanta, which is a magazine devoted to popular explanations of maths. As I’ve said before, this is statistically interesting (as well as being just interesting) because any instance of non-transitive dice is also an instance of a non-transitive Wilcoxon/Mann-Whitney test. So what do we now know about the Wilcoxon test? Multiple frame sampling http://notstatschat.netlify.com/2024/04/26/multiple-frame-sampling/ Fri, 26 Apr 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/04/26/multiple-frame-sampling/ [Updated: ran it with the right version of the code] I’m writing code for multiple-frame surveys in the survey package now, and it’s at the stage where the basic stuff works (Revision 337 from r-forge) though there’s quite a bit more to implement. The canonical references are the papers by Lohr and Rao. This post is just me thinking about it. If I had actual artistic talent or an ethically-trained AI I’d illustrate this post with friendly monsters in the style of Alison Horst, but you’ll just have to imagine them. Importance weights http://notstatschat.netlify.com/2024/04/19/importance-weights/ Fri, 19 Apr 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/04/19/importance-weights/ When I wrote about weights I mentioned that there was in some senses a fourth type of weights after sampling weights, precision weights, and frequency weights. The idea is that sometimes you have weights that you want to apply to an estimating function, but that they don’t have the same ontological commitments that any of the the three sets of weights come with. I’m working on dual-frame (and maybe multiple-frame) estimators for the survey package, and they are an example. Assumptions http://notstatschat.netlify.com/2024/04/14/assumptions/ Sun, 14 Apr 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/04/14/assumptions/ One problem in teaching statistics and communicating statistics and so on is “assumptions”. In fact, there’s at least two problems: Necessary vs sufficient The first problem is in maths communication. In maths you write down some assumptions and show they imply a conclusion. It’s usual in statistics for the assumptions to be sufficient for the conclusion, but pretty unusual for them to be necessary. People are bad at explaining the distinction to statisticians. Quantitative graphics? http://notstatschat.netlify.com/2024/04/05/intuitive-graphics/ Fri, 05 Apr 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/04/05/intuitive-graphics/ Two interesting examples from my e-bike: The first is common in e-bikes. The charge indicator is a set of five little rectangles inside a battery outline, which makes sense. It’s very non-linear, though. The first little rectangle is almost half the battery charge. When I mentioned this on Twitter some years ago the response was that lithium batteries are non-linear and there’s nothing that can be done about it. This superficially makes sense, until you think about it a bit. Symbolically nested http://notstatschat.netlify.com/2024/04/01/symbolically-nested/ Mon, 01 Apr 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/04/01/symbolically-nested/ ‘Symbolically nested’ is a phrase I invented to distinguish two different types of nested model when writing my book about survey analysis. There has been at least one question on Stack Overflow about the phrase, so I think it’s worth explaining in a bit more detail. Often, in math-stat discussions of nested models, you see the smaller model written with predictors $X$ and coefficients $\beta_X$ and the larger model written with predictors $(X,Z)$ and coefficients $(\gamma_X, \gamma_Z)$. Factors as factors http://notstatschat.netlify.com/2024/03/22/factors-as-factors/ Fri, 22 Mar 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/03/22/factors-as-factors/ With the long-awaited demise of stringsAsFactors=TRUE it’s now easier to use text strings in R. It’s good that strings don’t automatically get turned into factors at read time, but the price is that strings don’t automatically get turned into factors at read time: if you have variables that need to be factors, you have to turn them into factors yourself. Factors are still a important data type in R. They are R’s enumerated type; a factor knows what its possible levels are. Small-area estimates by smoothing direct estimates http://notstatschat.netlify.com/2024/03/09/small-area-estimates-by-smoothing-direct-estimates/ Sat, 09 Mar 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/03/09/small-area-estimates-by-smoothing-direct-estimates/ If we have a domain or subpopulation ${\cal D}$ and want to estimate the mean of a variable $Y$ in that domain, the usual survey estimator is \[\hat \mu_{\cal D}=\frac{\sum_{R_i=1} w_i Y_i I(i\in {\cal D})}{\sum_{R_i=1} w_i I(i\in {\cal D})}.\] That is, it’s the estimated population total in the domain divided by the estimated population count in the domain. We’ll call this a direct estimator; it depends only on data in the domain and is (approximately, depending on weight details) unbiased for the true mean. New in the survey package http://notstatschat.netlify.com/2024/03/08/new-in-the-survey-package/ Fri, 08 Mar 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/03/08/new-in-the-survey-package/ Version 4.4-1 of the survey package for R is percolating through CRAN. There are some important additions, visible and invisible The main invisible addition is from Ben Schneider, who has written a set of C++ routines that do the multistage stratified variance calculations previously done by svyrecvar. The compiled versions are the default; use options(survey.use_rcpp=FALSE) to disable them. The C++ code is faster; perhaps more important is that it gives the same answers independently and so is a check on the central routine of the package. Ordinal outcomes: the LOCT DOOR http://notstatschat.netlify.com/2024/01/25/ordinal-outcomes-the-loct-door/ Thu, 25 Jan 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/01/25/ordinal-outcomes-the-loct-door/ The DOOR outcome strategy – “Desirability Of Outcome Ranking” – is a relatively new approach to composite outcomes in clinical trials. Rather than collapsing multiple outcomes – death, heart attack, new-onset angina, bad hair day – into a single binary ‘bad thing’, the idea is to rank the trial participants by how bad their outcome is. DOOR is obviously attractive: these bad events are not all equally bad, so we would like to use an analysis that treats worse events as actually worse. Recurrent events: increased susceptibility or latent risk? http://notstatschat.netlify.com/2024/01/12/recurrent-events-increased-susceptibility-or-latent-risk/ Fri, 12 Jan 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/01/12/recurrent-events-increased-susceptibility-or-latent-risk/ Suppose you find, by analysis of crime data, that a house which has been burgled once is more likely to be burgled again in the following few months. This could happen because the house just has high burglary risk, due to the neighbourhood it’s in, the availability of easy escape routes, indicators of wealth, and so on. It could also happen because burglars know to come back a few months later when you’ve bought new stuff. Asymptotics for linear mixed models http://notstatschat.netlify.com/2024/01/09/asymptotics-for-linear-mixed-models/ Tue, 09 Jan 2024 00:00:00 +0000 http://notstatschat.netlify.com/2024/01/09/asymptotics-for-linear-mixed-models/ Attention Conservation Notice: This is probably well known in some circles Suppose you have a (parametric, Normal) linear mixed model \[Y=X\beta+Zb+\epsilon\] where $\epsilon$ are iid $N(0,\sigma^2)$ and $b$ are $N(0, \sigma^2V(\theta))$. Write $\Xi$ for the marginal covariance matrix of $Y$: \[\Xi = \mathrm{cov}[Y]=\sigma^2(I+Z^TVZ)\] The loglikelihood can be written \[\ell=-\frac{1}{2}\log\left|\Xi\right|-\frac{1}{2}\sum_{i,j} (Y_i-X_i\beta)^T\Xi^{-1}(Y_j-X_j\beta).\] We might want to treat this as a $M$-estimation problem and use the asymptotic behaviour of $\ell(\beta,\sigma^2,\theta)$ plus the smoothness of $\ell$ to deduce the asymptotic behaviour of the parameter estimates. Why do the Rao-Scott tests have good size? http://notstatschat.netlify.com/2023/12/18/why-do-the-rao-scott-tests-have-good-size/ Mon, 18 Dec 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/12/18/why-do-the-rao-scott-tests-have-good-size/ Attention Conservation Notice: this is about multiparameter hypothesis tests, which are intrinsically not very interesting In regression models (and contingency tables) for survey data, there are two classes of tests based on a division that’s more or less orthogonal to the score/Wald/LRT division. Consider score tests, and for notational simplicity pretend that we’re interested a test of the whole model rather than some sensible submodel. Suppose $\check{U}(\theta)$ is the weighted score vector, and $\check{I}(\theta)$ is the weighted Fisher information, \[\check{I}(\theta)=E\left[\frac{\partial \check{U}}{\partial \theta}\right],\] and define \[V = \mathrm{var}\left[\sum\check{U}(\theta)\right]. How good is the leading eigenvalue approximation to quadratic forms? http://notstatschat.netlify.com/2023/12/14/how-good-is-the-leading-eigenvalue-approximation-to-quadratic-forms/ Thu, 14 Dec 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/12/14/how-good-is-the-leading-eigenvalue-approximation-to-quadratic-forms/ A quadratic form in $n$ Gaussian variables is $Z^TAZ$ where $Z\sim N(0,B)$ is a Normal $n$-vector and $A$ is an $n\times n$ matrix. It has the distribution $Q=\sum_i^n \lambda_i \chi^2_1$ where $\lambda_i$ are the eigenvalues of $AB$ in decreasing order. If $n$ is large, this is a bit annoying to work out, so we approximate it by \[\sum_{i=1}^k \lambda_i\chi^2_1+a_k\chi^2_{d_k}\] with \[a_k=\frac{\sum_{i=k+1}^n\lambda_i^2}{\sum_{i=k+1}^n\lambda_i}\] and \[d_k=\frac{\left(\sum_{i=k+1}^n\lambda_i\right)^2}{\sum_{i=k+1}^n\lambda_i^2}.\] When $k=0$ this is the traditional Satterthwaite approximation, which isn’t at all bad over the middle of the distribution but falls apart in the tails; it is exponentially too small for small tail probabilities. Why not REML? http://notstatschat.netlify.com/2023/12/12/why-not-reml/ Tue, 12 Dec 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/12/12/why-not-reml/ The svylme package does maximum (weighted pairwise) likelihood. Linear mixed model software tends to also provide REML, either as an option or as the default. Why not here? I don’t think it would be too hard to take the definition of the REML criterion and weight and pairwise it. The question is whether that’s actually what is wanted. REML deals with the bias in variance components caused by using up degrees of freedom on the fixed effects. Sparse correlation and the Central Limit Theorem http://notstatschat.netlify.com/2023/12/04/sparse-correlation-and-the-central-limit-theorem/ Mon, 04 Dec 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/12/04/sparse-correlation-and-the-central-limit-theorem/ Back when I was a PhD student working on generalisations of GEE, I was interested in ‘sparse’ correlations, defined by ‘most’ small sets of observations being independent. One way to get this structure is from crossed clustering variables; another is for the basic units in your analysis to be pairs (or larger tuples) of observations. If you drew a graph with the variables as vertices, and connected correlated variables, then two subsets of the variables would be independent of each other if there was no edge between them. svy2lme: the preprint http://notstatschat.netlify.com/2023/11/24/svy2lme-the-preprint/ Fri, 24 Nov 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/11/24/svy2lme-the-preprint/ The announcement There’s now a preprint describing the svylme package and svylme::svy2lme function for fitting linear mixed models to complex samples. The models are of the form \[Y =X\beta+Zb+\epsilon\] where $\mathrm{var}[\epsilon]=\sigma^2$ and $\mathrm{var}[b]=\sigma^2 V(\nu)$ for parameters $\theta=(\beta,\sigma^2,\nu)$. The package allows $V$ to be a linear combination of known matrices. You can have random intercepts and random slopes where any two $b$s are either independent or identical. You can also have terms like \[V=\tau^2_e I_e+\tau^2_a G_a+\tau^2_dG_d\], where $I_e$ is a block diagonal indicator of shared environment, and $G_a$ and $G_d$ give the correlation of additive and dominant genetic effects. Linear mixed models with pairwise likelihood http://notstatschat.netlify.com/2023/11/21/linear-mixed-models-with-pairwise-likelihood/ Tue, 21 Nov 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/11/21/linear-mixed-models-with-pairwise-likelihood/ The most important part of this post library(svylme) It exists! Currently you have to get it from github remotes::install_github("tslumley/svylme") Let’s look at an example. This is data from the New Zealand component of the PISA educational survey data(nzmaths) There is only one school in one of the strata, so we’ll combine two strata: nzmaths$cSTRATUM<- nzmaths$STRATUM nzmaths$cSTRATUM[nzmaths$cSTRATUM=="NZL0102"]<-"NZL0202" We have weights both for the student and the school; the condwt variable is the implied weights for the student stage of sampling. Benchmark Archaeology http://notstatschat.netlify.com/2023/08/24/archaeology/ Thu, 24 Aug 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/08/24/archaeology/ Thanks to Luis Apiolaza on Mastodon, I was looking at some old R mailing list messages. I found one comparing the speed of R and S (and S compiled1 to C) for a loop-intensive program LMS <- function(M, N) { ### Pre-allocate result and filter. ### R <- matrix(0,nrow=M, ncol=5) W <- rep(0,5) for (i in 1:M) { ### Simulate MA(1) ### Z <- rnorm(N+1) X <- Z[2:(N+1)] + 0. Quoting and requoting http://notstatschat.netlify.com/2023/08/04/quoting-and-requoting/ Fri, 04 Aug 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/08/04/quoting-and-requoting/ Someone asked on StackOverflow how to compute the kurtosis and its standard error using the survey package. They were thinking of using svyrecvar, the variance estimation function that underlies most things in the package. That’s not actually the easiest approach: svyrecvar works with estimating questions, but for the kurtosis we don’t have estimating functions and instead have an explicit definition in terms of totals: \[\kappa = \frac{E[(X-\mu)^4]}{E[(X-\mu)^2]^2}.\] It’s likely to be easier to work with svycontrast and represent $\hat\kappa$ as a function of estimated moments. Blank-cheque inheritance and statistical methods objects http://notstatschat.netlify.com/2023/06/07/blank-cheque-inheritance-and-statistical-objects/ Wed, 07 Jun 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/06/07/blank-cheque-inheritance-and-statistical-objects/ One of the problems with object-oriented programming for statistical methods is that inheritance is backwards. Everything is fine for data structures, and Bioconductor has many examples of broad (often abstract) base classes for biological information or annotation that are specialised by inheritance to more detailed and useful classes. Statistical methods go the other way In base R, glm for generalised linear models is a generalisation of lm for linear models, but the glm class inherits from the lm class, not the reverse. Pairwise likelihood and cluster sizes http://notstatschat.netlify.com/2023/05/05/pairwise-likelihood-and-cluster-sizes/ Fri, 05 May 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/05/05/pairwise-likelihood-and-cluster-sizes/ So, I’m working on svylme again, for linear mixed models under complex sampling. It uses pairwise likelihood, following the basic idea from the Canadians, but extending it to settings where the design structure and model structure are different. It’s always hard finding examples to check against when you’re doing something new. After getting quite reasonable results from simulations, I tried an example from the lme4qtl package, which is a subset of a dataset on milk production by dairy cows. New in the survey package http://notstatschat.netlify.com/2023/05/04/new-in-the-survey-package/ Thu, 04 May 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/05/04/new-in-the-survey-package/ Version 4.2-1 of the survey package has hit CRAN! There are three major changes. First, when you ask for influence functions for an estimate, in order to get comparisons between subpopulations or something, you get an influence function for each record in the dataset – if there are missing values, the influence function is zero (a bit like the way na.exclude works). Second, regTermTest now handles missing data better – it used to assume that the two models you were comparing were fitted to the same data, but now it checks. Ranks in survey data http://notstatschat.netlify.com/2023/04/18/ranks-in-survey-data/ Tue, 18 Apr 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/04/18/ranks-in-survey-data/ Someone asked on CrossValidated about signed rank tests for complex sample data. They had defined a signed rank function signrank<-function(x) rank(abs(x))*sign(x) It’s then easy to use svymean to estimate the mean signed rank and its standard error. If we write $f$ for the transformation from the variable to the signed rank, then this gives a valid point estimate of the population mean of $f(X)$ and its standard error. You can then get a (approximately) valid test for the null hypothesis that the mean of $f(X)$ is zero. Class imbalance: bug or feature? http://notstatschat.netlify.com/2023/03/27/class-imbalance-bug-or-feature/ Mon, 27 Mar 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/03/27/class-imbalance-bug-or-feature/ Medical training, of necessity, over-represents unusual medical conditions. There are lots of unusual medical conditions, and trainees need to see at least a reasonable sampling of them. It is proverbial that new doctors tend to over-classify in favour of unusual medical conditions: in the 1940’s a medical professor said “When you hear hoofbeats behind you, don’t expect to see a zebra.” In machine learning, there is a lot of concern about class imbalance. Which infinite sequence? http://notstatschat.netlify.com/2023/03/20/which-infinite-sequence/ Mon, 20 Mar 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/03/20/which-infinite-sequence/ I’m at ENAR, and there are talks with asymptotic theory1. One thing that caught my attention is problems with two different sample sizes, eg, a main sample and a validation sample. Call the two sample sizes $m$ and $n$. Theorems are then proved under the assumption that $m/n$ converges to finite non-zero constant $C$. What is the statistical content of this assumption? In an application, we have one data set, with one particular value of $m$ and $n$. The fourth-root thing http://notstatschat.netlify.com/2023/03/07/the-fourth-root-thing/ Tue, 07 Mar 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/03/07/the-fourth-root-thing/ This post is partly because I think the result is interesting and partly to see if anyone will tell me an original reference. Suppose we get $\hat\beta$ by solving $U(\beta;\alpha)=0$ and that $\alpha$ is a nuisance parameter we plug into the equation. Assume that for any fixed $\alpha$, \[E[U(\beta_0;\alpha)]=0.\] Assume \[U(\beta,\alpha)=\frac{1}{n}\sum_{i=1}^n U_i(\beta,\alpha)\] and that $U$ converges pointwise (and in mean, assuming finite moments) to its expected value. Also assume enough other regularity that this leads to \[\sqrt{n}(\hat\beta-\beta)\stackrel{d}{\to} N(0,\sigma^2(\alpha)). Determinant of correlation matrix http://notstatschat.netlify.com/2023/03/06/determinant-of-correlation-matrix/ Mon, 06 Mar 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/03/06/determinant-of-correlation-matrix/ Last week I needed to know how big determinants of correlation matrices could get1, and I couldn’t immediately find a proof on Google, so I worked one out. Theorem: The determinant of a correlation matrix is at most 1 Proof: Let $p$ be the dimension of the matrix. The trace of the matrix is $p$, and that’s the sum of the eigenvalues, so the arithmetic mean of the eigenvalues is 1. Sandwiches and aggregation http://notstatschat.netlify.com/2023/02/21/sandwiches-and-aggregation/ Tue, 21 Feb 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/02/21/sandwiches-and-aggregation/ Demetri Pananos asked a question on Stack Overflow and then on Twitter about the behaviour of the sandwich package for aggregated count data, and Achim Zeileis pinged me1. If you have a Poisson regression and you have $N_i$ observations $Y_{ij}$ that share the same covariates you can aggregate them into a single observation $Y_{i\cdot}=\sum_j Y_{ij}$ with an offset $\log N_i$, and note that \[\log E[Y_{ij}|X_{i}]=X_i\beta\] implies \[\log E[Y_{i\cdot}|X_{i},N_i]=X_i\beta+\log N_i.\] When is population mean rank a thing? http://notstatschat.netlify.com/2023/02/10/when-is-population-mean-rank-a-thing/ Fri, 10 Feb 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/02/10/when-is-population-mean-rank-a-thing/ You used to see a lot of the incorrect and misleading description of the Wilcoxon rank-sum and Kruskal-Wallis tests as comparing medians. I’ve tried from time to time, without success, to find where this idea originated. The motivation is clear, though: tests are much more useful if you know what they are testing. Increasingly, people know the ‘medians’ explanation of the Wilcoxon test isn’t true and recognise that it isn’t helpful. Checking proportionality of odds http://notstatschat.netlify.com/2023/02/09/checking-proportionality-of-odds/ Thu, 09 Feb 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/02/09/checking-proportionality-of-odds/ The proportional odds model The proportional odds model is a generalisation of the logistic model. If you have a $K$-level ordered factor, you could dichotomise it in $K-1$ different places, and get $K-1$ logistic regression models. The intercepts of these models have to be different, but the slopes could (in principle) be the same or same-ish. That’s the proportional odds model \[\mathrm{logit}\, P(Y>k)=\alpha_k+\beta X\] It’s quite attractive as a generative model for ordinal data; it enforces stochastic ordering and has lots of choices for link functions. Linkage and multiple imputation http://notstatschat.netlify.com/2023/01/06/linkage-and-multiple-imputation/ Fri, 06 Jan 2023 00:00:00 +0000 http://notstatschat.netlify.com/2023/01/06/linkage-and-multiple-imputation/ For a while, I’ve been thinking about multiple imputation and its extension to record linkage. Record linkage1 is about taking two (or more) data sets and saying that record A in data set 1 is probably the same person as record 23 in data set 2. The probabilistic framework for it dates back to (or is at least routinly attributed to) Fellegi & Sunter (eg). It says that two records will show the same name and year of birth, say, either if the data are correct and they are the same person, or if the data are wrong and just happen to agree, and all the possibilities can be modelled. Pairwise and joint independence http://notstatschat.netlify.com/2022/12/09/pairwise-and-joint-independence/ Fri, 09 Dec 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/12/09/pairwise-and-joint-independence/ It’s intuitively plausible that pairwise independence of a set of variables implies joint independence. Sadly, it’s not true. Not everyone seems to know my favourite simple example, a chessboard: (image credit) If you pick a point uniformly at random from the chessboard (or just pick a square uniformly at random), the row, column, and colour are pairwise independent. The row and column are independent because of uniform sampling; in any row, the colour is independent of the $x$-coordinate; in any column, the colour is independent of the $y$-coordinate. A short note on effect sizes http://notstatschat.netlify.com/2022/12/06/a-short-note-on-effect-sizes/ Tue, 06 Dec 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/12/06/a-short-note-on-effect-sizes/ From time to time, I get asked about estimating effect sizes in the survey package. I don’t really use effect sizes, because in the applied fields where I work, people are directly interested in the $X$ variables they measure. They think about the effect of, say, differences in systolic blood pressure and heart disease risk in terms of blood pressure differences, measured in mmHg. They expect the impact of a 10mmHg difference to be similar in similar populations, and they prefer their $\beta$s to be in these units. The sandwich and the t-test http://notstatschat.netlify.com/2022/12/01/the-sandwich-and-the-t-test/ Thu, 01 Dec 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/12/01/the-sandwich-and-the-t-test/ As every schoolchild know, you can derive the Student $t$-test as a linear regression with a single binary predictor. How about the Welch/Satterthwaite unequal-variance $t$-test? We have a technique for handling linear regression with unequal variances in the responses, the ‘model-agnostic’1 or ‘model-robust’ sandwich estimator. You might wonder what happens if you use the sandwich estimator on a linear regression with a single binary predictor. Let $X$ be binary, coded so it has zero mean (so that it’s orthogonal to the intercept) and fit a linear model with $Y$ as the outcome and $X$ as the predictor: \[E[Y]=\alpha+\beta X. Bus pruning http://notstatschat.netlify.com/2022/11/05/bus-pruning/ Sat, 05 Nov 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/11/05/bus-pruning/ As I said last time, there have been quite a lot of bus cancellations in Auckland recently. Auckland Transport have decided to (temporarily) give up on some of the more-frequently cancelled trips, so that people can plan their travel more sensibly1, which is actually a good idea. So, which buses are we losing? The static GTFS data at AT currently describes the new timetable and the current timetable. Improving a graph http://notstatschat.netlify.com/2022/11/03/improving-a-graph/ Thu, 03 Nov 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/11/03/improving-a-graph/ A lot of buses are being cancelled in Auckland at the moment. This is partly due to Covid, but also due to difficulty in recruiting bus drivers because of poor pay and conditions. And probably other reasons, too. I’ve put about six weeks of daily cancellation data in a Github gist Here’s a default graph: d<-read.table("https://gist.githubusercontent.com/tslumley/9ac8df14309ecc5936183de84b57c987/raw/9ebf665b2ff9a93c1dbc73caf5ff346909899827/busdata.txt",header=TRUE) d$date<-as.Date(paste(2022, d$mo, d$d,sep="-")) plot(cancels~date, data=d) There are a lot of cancellations, but otherwise it’s not all that clear. Code archaeology: polynomial distributed lags http://notstatschat.netlify.com/2022/10/14/code-archaeology-polynomial-distributed-lags/ Fri, 14 Oct 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/10/14/code-archaeology-polynomial-distributed-lags/ Back in the early 2000s, when I was working on air pollution epidemiology, I wrote some code to fit polynomial distributed lag models. These are a slightly primitive form of regularisation for when you want several lags of an exposure variable as predictors. Last week I was looking for an R package to fit these models, for a student working on Covid wastewater modelling. I didn’t find an R package, but my code was still on the University of Washington website, here. A plug-in uniform law of large numbers http://notstatschat.netlify.com/2022/09/28/uniform-law-of-large-numbers/ Wed, 28 Sep 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/09/28/uniform-law-of-large-numbers/ Attention Conservation Notice: I’m writing this down because I just spent too long trying to find a citation for it to give a student. A useful citation for many purposes is Newey WK (1989) “Uniform Convergence in Probability and Stochastic Equicontinuity” Princeton Econometric Research Memorandum No 342 There are lots of laws of large numbers: theorems whose conclusions are that an average $\bar X_n$ converges in some useful sense to an expected value $\mu$. Looking back http://notstatschat.netlify.com/2022/09/15/looking-back/ Thu, 15 Sep 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/09/15/looking-back/ Attention Conservation Notice: NZ alternative history. A week or so ago this would have been a timely response to two big news stories involving 50-year estimates of large sums of money to a purported two decimal place accuracy. With the new NZ Super reforms just enacted, it is a good time to look back at the nearly five decades of the controversial national savings scheme. Few would deny that there has been some benefit; Kiwis over 65 are now the wealthiest age group and old-age poverty is no longer an expectation, at least for the middle class. Tracking down a Real Data Set(tm) http://notstatschat.netlify.com/2022/08/14/tracking-down-a-real-data-set/ Sun, 14 Aug 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/08/14/tracking-down-a-real-data-set/ As you may have noticed, I’ve been writing software for handling multiple-response data: a package for data manipulation and another one for modelling. Ivy Liu sent me two data sets she had used in a paper. One was an interesting experiment from a linguist colleague in Wellington; the other was from a competing stats paper by Bilder and Loughin proposing a different analysis method. This data set was on the relative risk of urinary tract infection in 239 women at a university according to the type(s) of contraception they used. ASCII and beyond — will it play in Peoria? http://notstatschat.netlify.com/2022/07/26/ascii-and-beyond-in-packages/ Tue, 26 Jul 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/07/26/ascii-and-beyond-in-packages/ One of the checks in R CMD check is for non-ASCII strings in data objects. Another is for non-ASCII variable names. At first, this sounds wrong. Why shouldn’t you use non-ASCII strings in data objects? If you want to count the number of cafés in Ōtautahi serving guláš, R can handle the accent aigu and the tohutō and the háček perfectly well. There are two quite separate issues. First, can you, dear reader, use non-ASCII strings properly? Tidying rimu http://notstatschat.netlify.com/2022/07/23/tidying-rimu/ Sat, 23 Jul 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/07/23/tidying-rimu/ I’ve been working on the rimu package recently, which handles multiple-response categorical data. This has involved miscellaneous fixing, but also getting these new types to work happily with data frames, in preparation for the rata package that will support modelling and inference1. For example, consider this little six-observation pretend dataset: library(rimu) data(nzbirds) seen <- as.mr(nzbirds) seen ## [1] "kea+tui" "kea+ruru+kaki" "ruru" ## [4] "ruru+tui" "kea+ruru+tui+kaki" "kea+?ruru+tauhou" str(seen) ## 'mr' logi [1:6, 1:5] kea+tui kea+ruru+kaki ruru ruru+tui kea+ruru+tui+kaki kea+tauhou ## - attr(*, "dimnames")=List of 2 ## . Combining a survey and other data http://notstatschat.netlify.com/2022/07/15/combining-a-survey-and-other-data/ Fri, 15 Jul 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/07/15/combining-a-survey-and-other-data/ Sometimes you want to do inference that combines a real probability survey sample with some other data. I first looked at this when working on rank tests for survey data, where people had tried to compare data from convenience samples to semething like NHANES as a population reference. If this was a $t$-test, the inference could just be based on the distribution of the estimated mean in the survey and in the lab data, but for rank tests the two samples need to be combined. Self-promotion: an actual multiwave two-phase design http://notstatschat.netlify.com/2022/07/06/self-promotion-an-actual-multiwave-two-phase-design/ Wed, 06 Jul 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/07/06/self-promotion-an-actual-multiwave-two-phase-design/ We have a paper coming out in Biometrics describing a reasonably complicated example of validation sampling from electronic health records. “We” here is a lot of people: Bryan Shepherd, Kyunghee Han, Tong Chen, Aihua Bian, Shannon Pugh, Stephany Duda, Thomas Lumley, William Heerman, Pamela Shaw. Pam and Bryan led the project from the statistical side, William Heerman was the clinical lead. The project was to see how mothers’ weight gain during pregnancy affected the risk of asthma and childhood obesity in kids. Getting strings into code in base R http://notstatschat.netlify.com/2022/06/23/getting-strings-into-code-in-base-r/ Thu, 23 Jun 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/06/23/getting-strings-into-code-in-base-r/ I’m reasonably often asked how to take the value of a character string variable and use it as a variable name in, eg, the survey package. This is the sort of quasiquotation that the tidyverse uses a lot. It’s needed much more often in the tidyverse because of the use of bare variable names as function arguments, but sometimes you need it in base R as well. I should first say that quasiquotation in base R should be a last resort. Design |> Data: Ihaka Lectures 2022 http://notstatschat.netlify.com/2022/06/21/design-data-ihaka-lectures-2022/ Tue, 21 Jun 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/06/21/design-data-ihaka-lectures-2022/ They’re back. Partly. We managed to have the Ihaka Lectures in person for 2020 and 2021, but we chickened out on 2022 and are going for internet-based talks. Covid permitting, we will still have an in-person event in Auckland to watch the talks and interact with the speakers. On the bright side, this means we can get great speakers who might not come all the way to Auckland. We have Tables with zeroes http://notstatschat.netlify.com/2022/04/17/tables-with-zeroes/ Sun, 17 Apr 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/04/17/tables-with-zeroes/ A survey package user asked me (and StackExchange) how to do tests of independence in contingency tables when there’s a zero cell in the table. I didn’t do the sensible thing and check data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svytable(~sch.wide+comp.imp, dclus1) ## comp.imp ## sch.wide No Yes ## No 778.4809 0.0000 ## Yes 913.8689 4501.6505 svychisq(~sch.wide+comp.imp, dclus1) ## ## Pearson's X^2: Rao & Scott adjustment ## ## data: svychisq(~sch. stringsAsFactors=do_you_feel_lucky http://notstatschat.netlify.com/2022/03/31/stringsasfactors-do-you-feel-lucky/ Thu, 31 Mar 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/03/31/stringsasfactors-do-you-feel-lucky/ Character string variables have suddenly become much more common in R, with the default stringsAsFactors=FALSE. That’s often good, but factors are actually an important data type. In particular, factors know what levels they have, but strings don’t. Suppose, following a very helpful bug report I recieved today, you want to estimate the proportions of California schools in each county, and you want to do this separately for schools that do and don’t meet their improvement targets in standardised tests. Nine and sixty ways http://notstatschat.netlify.com/2022/02/20/relevant-asymptotics/ Sun, 20 Feb 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/02/20/relevant-asymptotics/ Attention Conservation Notice: there’s not actually anything new here A few years ago, in a (mostly positive) review of Proschan & Shaw’s Essentials of Probability Theory for Statisticians, I wrote As some statistics students notice, there’s a bit of a bait-and-switch when we talk about rigor for statistics but then prove theorems about infinite sequences of real-valued random variables. Actual random variables are available for only one $n$ in the infinite sequence, and are discrete and bounded. Comparing tests for generalised linear models in survey data http://notstatschat.netlify.com/2022/01/28/comparing-tests-for-generalised-linear-models-in-survey-data/ Fri, 28 Jan 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/01/28/comparing-tests-for-generalised-linear-models-in-survey-data/ In ordinary statistics, there are three popular types of model-based tests: score tests, Wald tests, and likelihood ratio tests. They agree pretty well; they are locally asymptotically equivalent; the picture is here. In survey data, things get more complicated because you’re not actually fitting by maximum likelihood. There are two branches to testing ‘working model’ tests, where you take the test statistic you would use under iid sampling and compute its actual distribution under the null hypothesis that the population data generating process satisfies the null hypothesis. Optimal design for raking/AIPW estimation http://notstatschat.netlify.com/2022/01/06/optimal-design-for-raking-aipw-estimation/ Thu, 06 Jan 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/01/06/optimal-design-for-raking-aipw-estimation/ The second paper from Tong Chen’s PhD thesis with me has just been published at Statistics in Medicine. First, here’s what an AI thinks of it: As I mentioned back here, you can work out the optimal stratum allocations for the inverse-probability-weighted (IPW) version of any1 estimator by using the classical “Neyman Allocation” formula on the influence functions of the estimator. The estimator is approximately the sum of its influence functions (which is what influence functions are for), and Neyman allocation works for optimal estimation of sums. Per capita, in mice http://notstatschat.netlify.com/2022/01/02/per-capita-in-mice/ Sun, 02 Jan 2022 00:00:00 +0000 http://notstatschat.netlify.com/2022/01/02/per-capita-in-mice/ This was roughly my plenary talk at Herenga Delta, the 13th Southern Hemisphere Conference on the Teaching and Learning of Undergraduate Mathematics and Statistics. I was invited to give a talk around StatsChat. It’s only approximately my talk because it’s based on working notes rather than a transcript. I also gave a shorter talk with some of this material at Skepticon, the joint NZ and Australian Skeptics conference Top posts from 2021 http://notstatschat.netlify.com/2021/12/31/top-posts-from-2021/ Fri, 31 Dec 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/12/31/top-posts-from-2021/ From 2020:, the different ways the term “weights” is used in statistics and when it matters. See also, when is and isn’t it ok to just subset a survey data set by dropping records From 2019: What have I got against the Shapiro-Wilk test? From May: Why causal models are relevant to prediction: because they are relevant to generalisation. See also, mushrooms as an example From February: Co-linearity. “Collinearity diagnostics aren’t much help, because they don’t tell you whether you’re interested in $\beta_X$ or $\gamma_X$. Is it binary? http://notstatschat.netlify.com/2021/11/10/is-it-binary/ Wed, 10 Nov 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/11/10/is-it-binary/ As part of adding crossed correlations (and other sparse correlation) to the survey package, I was writing code to test whether a user-supplied adjacency matrix had only 0 and 1 values. Since this is just a validity test for a user-supplied argument, it will usually pass and needs to pass as fast as possible. How fast it fails is less important. Also, since the survey package is pure R, it needs to be pure R. Crossed clustering and parallel invention http://notstatschat.netlify.com/2021/09/18/crossed-clustering-and-parallel-invention/ Sat, 18 Sep 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/09/18/crossed-clustering-and-parallel-invention/ This week, I was prompted to find some old R code xeffect.glm<-function(glm.obj,g1,g2){ if (!exists("rowsum")) require(survival4) umat<-estfun.glm(glm.obj) usum1<-rowsum(umat,g1,reorder=F) usum2<-rowsum(umat,g2,reorder=F) g1a<-as.numeric(as.factor(g1)) g2a<-as.numeric(as.factor(g2)) g12<-g1a*(1+max(g2a))+g2a usum12<-rowsum(umat,g12,reorder=F) utu<-(t(usum1)%*%usum1)+t(usum2)%*%usum2-t(usum12)%*%usum12 modelv<-summary(glm.obj)$cov.unscaled modelv%*%utu%*%modelv } You can tell it’s old: the use of F rather than FALSE indicates I was still using S-PLUS reasonably often, and it was a time when rowsum wasn’t in base R (or was only recently). I could find the code because it’s been sitting on my old UW web page, which they kindly haven’t taken down yet. Score tests: surprisingly annoying http://notstatschat.netlify.com/2021/09/10/score-tests-surprisingly-annoying/ Fri, 10 Sep 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/09/10/score-tests-surprisingly-annoying/ I’ve just been adding score tests for generalised linear models to the survey package. The folklore on score tests is that they’re computationally easy because you only need to fit the restricted model, not the full model. This… turns out not to be the case. It’s true in some settings that score tests are easy. The big example is when the null model really is a null model and, eg, you get the Wilcoxon rank test instead of having to go and fit a proportional odds model, or when the score test in a bunch of different generalised linear models reduce to the same covariance statistic. Ordinal data, metadata, and models http://notstatschat.netlify.com/2021/09/03/ordinal-data-and-models/ Fri, 03 Sep 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/09/03/ordinal-data-and-models/ Ok, this is another attempt at clarifying my thinking about what is and isn’t problematic with ordinal data. We have: ordinal data, as a scale of measurement: data that has a finite (or, I suppose, infinite) set of possible values and where the metadata specifies a linear ordering on the values but nothing more. rank tests: tests that are functions only of the ranks of the observations in a dataset ordinal models: genuine parametric or semiparametric models that imply an ordering (usually, but not necessarily a linear order) of values but not on any other constraints on the values. Pictures of code are not code http://notstatschat.netlify.com/2021/08/21/pdf-is-not-a-repository/ Sat, 21 Aug 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/08/21/pdf-is-not-a-repository/ We’re in Covid lockdown again in Auckland, and it’s a weekend, so I happened to be on the Stats stackoverflow site.1 Someone was trying to use R to check MLE calculations for a few papers that propose new parametric models – the “overgeneralised Gamma distribution” genre of statistical literature. These paper typically have “real data sets”2, which might provide a good introduction to optimisation problems that are a bit harder than generalised linear models but don’t require any specialised knowledge. Wellington buses http://notstatschat.netlify.com/2021/08/13/wellington-buses/ Fri, 13 Aug 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/08/13/wellington-buses/ When there are problems with public transport it’s much more salient than when things are going well, so I think it’s helpful to have consistent data available on what the problems actually are. For a few years, the bot @tuureiti has been tweeting status information about the Auckland bus network: number of buses, a graph of delays, and now a summary of headway. Now it’s Wellington’s turn. Wellington used to have an open but non-standard information feed. The New Oil http://notstatschat.netlify.com/2021/08/07/the-new-oil/ Sat, 07 Aug 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/08/07/the-new-oil/ Data is the New Oil is a seriously underrated metaphor: started off as just a curiosity for nerds creates new companies and industries and vastly improves the productivity of some existing ones control over sources or extraction/transport infrastructure makes some individuals/companies/nations rich and powerful this has absolutely no tendency to be good for their character and ethics transporting and processing large quantities increases the risk of toxic leaks, which can cause a lot of local damage when successfully used, produces exhaust products that are relatively harmless in small amounts but cumulatively dangerous use becomes integrated into the economy and is hard for individuals to avoid which is taken as consent by those who control the extraction and exploitation Maintenance of Headway http://notstatschat.netlify.com/2021/08/04/maintenance-of-headway/ Wed, 04 Aug 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/08/04/maintenance-of-headway/ The Twitter bus bot, @tuureiti, which I wrote about before, is back! It now has a new feature. In addition to showing a dot plot of bus delays, and a summary of the percentage on time, it summarises headway. The details might change in the future1, but this post describes what it does now. Headway is the spacing between buses. Maintenance of headway is very important in running a high-frequency bus route. Subsets and subpopulations in survey inference http://notstatschat.netlify.com/2021/07/22/subsets-and-subpopulations-in-survey-inference/ Thu, 22 Jul 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/07/22/subsets-and-subpopulations-in-survey-inference/ A question I’ve now answered enough times that it’s worth turning into a blog post: when it is ok and not ok to take a subset of a probability sample? It matters Let’s look at a simple example. We can use the API dataset built into the survey package to estimate the number of school students in California in the year 2000. suppressMessages(library(survey)) data(api) dclus1<-svydesign(id=~dnum,weights=~pw, data=apiclus1, fpc=~fpc) svytotal(~enroll, dclus1) ## total SE ## enroll 3404940 932235 Suppose we wanted to estimate the number of students at high schools in California. What's new in the survey package http://notstatschat.netlify.com/2021/07/20/what-s-new-in-the-survey-package/ Tue, 20 Jul 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/07/20/what-s-new-in-the-survey-package/ There’s a new version of the survey package on CRAN (yay!). A lot of it is minor or relatively esoteric fixes. There’s one big change, which will break some people’s code. The svyquantile function has been completely rewritten. You might naively think quantiles are easy, but they aren’t The $p$th quantile is defined as the value where the estimated cumulative distribution function is equal to $p$. As with quantiles in unweighted data, this definition only pins down the quantile to an interval between two observations, and a rule is needed to interpolate. Not all strictly monotone functions are additive http://notstatschat.netlify.com/2021/07/16/not-all-strictly-monotone-functions-are-additive/ Fri, 16 Jul 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/07/16/not-all-strictly-monotone-functions-are-additive/ Lemma: Suppose $X_1,\dots,X_K$ are bounded non-negative random variables, $f_1,\dots, f_K$ are bounded non-negative functions of the corresponding $X_k$, and $g(X_1,\dots,X_K)=\sum_{k=1}^K f_k(X_k)$. Write ${\cal G}$ for the class of functions generated this way. Write ${\cal H}$ for the classs of strictly monotone functions mapping the $X_k$ onto $[0,1]$. Then ${{\cal G}}$ is not dense in ${\cal H}$ Proof It’s obvious, innit. This post is actually a complaint about grading policies. When rating things there is increasing pressure to break the problem down into little pieces (two cheers for reductionism), assign points to each one (two cheers for maths), and then add them up. Housing unaffordability hexmaps http://notstatschat.netlify.com/2021/06/11/housing-unaffordability-hexmaps/ Fri, 11 Jun 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/06/11/housing-unaffordability-hexmaps/ A couple of days ago, Emma Vitz posted some depressing maps of NZ housing affordability on Twitter. Here’s one of them The map shows, for each NZ region, the difference between the median household income and the income required to buy the average house. ‘Income required’ is based on the recommendation to spend no more than 30% of your income on housing, together with assuming a 20% deposit and 4% mortage interest. Generalisability, prediction, and causation http://notstatschat.netlify.com/2021/05/02/generalisability-prediction-and-causation/ Sun, 02 May 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/05/02/generalisability-prediction-and-causation/ One of the big steps forward in statistics over the past few decades is the widespread appreciation that regression modelling for causal inference and predictive inference are different1. In causal inference you choose your model so that one of the coefficients means what you want it to mean; in predictive inference you choose your model so that it predicts well, and you don’t care about the interpretations of the coefficients. A modest proposal for matrix multiplication http://notstatschat.netlify.com/2021/04/01/a-modest-proposal-for-matrix-multiplication/ Thu, 01 Apr 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/04/01/a-modest-proposal-for-matrix-multiplication/ Suppose you have a data frame containing a matrix. On tidy principles, it should probably be stored with the values in one column, the row ids in another, and the column ids in another. If you’ve got two matrices they could be in different data frames, or they could be in different rows of the same data frame, like this mat_mult <-function(.data, row, col,value, matrix_id){ .data %>% distinct({{matrix_id}}) -> ids . Phobos and Deimos and public speaking http://notstatschat.netlify.com/2021/03/19/phobos-and-deimos-and-public-speaking/ Fri, 19 Mar 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/03/19/phobos-and-deimos-and-public-speaking/ The moons of Mars are called Phobos and Deimos, named for twin sons of the war god Ares in Ancient Greek myth. In the past, you’d look for explanations of the names in astronomy books, and they would often give the translations ‘terror’ and ‘fear’, respectively, which is a bit unhelpful. It wasn’t until reading Brett Devereaux’s blog post on pre-battle speeches in history vs in the Lord of the Rings that I learned more about the distinction (though it is now well-described in Wikipedia). Two-phase sampling notation http://notstatschat.netlify.com/2021/02/15/two-phase-sampling-notation/ Mon, 15 Feb 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/02/15/two-phase-sampling-notation/ Attention conservation notice: an attempt to get a small number of other people, probably not including you, to adopt our notation. Together with groups at the University of Pennsylvania and Vanderbilt, we have been working on methods for the design and analysis of two-phase samples, samples taken from an existing cohort or database to measure new variables. The problem combines measurement-error, missing-data, and sampling ideas, so questions of notation can get fraught. Co-linearity http://notstatschat.netlify.com/2021/02/11/co-linearity/ Thu, 11 Feb 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/02/11/co-linearity/ Co-linearity of regression predictors (often ‘multicollinearity’) is one of those topics where a lot of regression textbooks are Unhelpful, in part because they don’t think about the reasons for fitting a model. Their idea is, often, that you should examine diagnostics that tell you whether there is multicollinearity, and use these diagnostics to remove variables from your model. Let us suppose we have two predictor variables $X$ and $Z$ and an outcome $Y$, and that the models \[E[Y|X=x]=\beta_0+\beta_Xx\] and \[E[Y|X=x,Z=z]=\gamma_0+\gamma_Xx+\gamma_Zz\] are (approximately) correctly specified. They're back! http://notstatschat.netlify.com/2021/02/02/they-re-back/ Tue, 02 Feb 2021 00:00:00 +0000 http://notstatschat.netlify.com/2021/02/02/they-re-back/ The Ihaka Lectures for 2021! The speakers are all local, because quarantine and borders and so on, and wonderful because just because. Dr Simon Urbanek is a member of R Core and has been on the academic staff in the Statistics department since just before the borders were closed last year. He’s talking about infrastructure for distributed statistical computing. Prof Rhema Vaithianathan is a health economist and is Director of the Centre for Social Data Analytics at AUT. Emma Lathen e-books http://notstatschat.netlify.com/2020/12/31/emma-lathen-e-books/ Thu, 31 Dec 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/12/31/emma-lathen-e-books/ Attention conservation notice: genre fiction from decades ago Emma Lathen wrote a long series of light mystery novels starring John Putnam Thatcher, the senior vice president of a large investment bank. This was back in the days when investment bankers didn’t have the same Gilded-Age-Vampire-Squid associations, and the contrast with modern views of bankers is probably the way they’ve dated most. As the books were written by two women with professional careers, they have been visited by the Suck Fairy less than one might fear1. Top posts in 2020 http://notstatschat.netlify.com/2020/12/31/top-posts-in-2020/ Thu, 31 Dec 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/12/31/top-posts-in-2020/ Attention Conservation Notice: these are the most popular posts, so if you cared you would already have seen them According to Google Analytics, these were the most popular pages on this site in 2020: What have I got against the Shapiro-Wilk test? This 2019 post was popular because it has been linked on StackOverflow and by Hadley Wickham, which is why it’s at the top. I still think it’s worth reading. Planning a new data management course http://notstatschat.netlify.com/2020/12/18/planning-a-new-data-management-course/ Fri, 18 Dec 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/12/18/planning-a-new-data-management-course/ We have an existing course at third-year undergraduate level that reprises the second-year regression course using SAS and also does a bit more SAS for data management. We’re going to get rid of it, and this is an early draft of my plan for its replacement (probably in 2022). The original working title for the replacement was “Ecumenical Data Wrangling”: that is, data management taught in a form that isn’t specific to SAS or specific to R (or, Lorde help us, Excel). Neyman Allocation, only exact http://notstatschat.netlify.com/2020/11/05/neyman-allocation-only-exact/ Thu, 05 Nov 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/11/05/neyman-allocation-only-exact/ If you have a fixed number of observations to allocate over a set of sampling strata and want to estimate the population total or mean of a variable $Y$, Neyman proved in 1934 that the optimal allocation is proportional to the population size in the stratum, proportional to the standard deviation of $Y$ in the stratum, and inversely proportional to the cost of sampling from the stratum. In math, the Neyman allocation rule is \[n_k\propto N_k\sigma_k/c_k\] You will probably not be eaten by a grue http://notstatschat.netlify.com/2020/10/30/you-will-probably-not-be-eaten-by-a-grue/ Fri, 30 Oct 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/10/30/you-will-probably-not-be-eaten-by-a-grue/ Michael Droste tweeted Should I use Stata, R, Matlab, Julia, etc etc for my research? What #econtwitter WON’T tell you is that all of these share a fatal flaw: you can’t play Oregon Trail on them… … At least, until now! Now you can play Oregon Trail (1978) in Stata. That’s clearly a challenge that any sensible R programmer would ignore. However, it does fit into an idea I’ve been thinking about for a while. When the sky didn't fall http://notstatschat.netlify.com/2020/09/26/when-the-sky-didn-t-fall/ Sat, 26 Sep 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/09/26/when-the-sky-didn-t-fall/ I’m hearing that, in contrast to basically every other political question ever, there might be some benefit in boring middle-aged white men saying what they think about the cannabis referendum. So: Vote Yes. The status quo is terrible. I’m not writing from personal experience here – as a relatively asocial and cautious nerd, I didn’t even get the opportunity to turn it down until I was 21, in the US as an exchange student. MOAR survey regression models http://notstatschat.netlify.com/2020/09/24/moar-survey-regression-models/ Thu, 24 Sep 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/09/24/moar-survey-regression-models/ The VGAM package’s vglm() function, like the survey package’s svymle() function, allows for maximum likelihood fitting where linear predictors are added to one or more parameters of a distribution — but vglm() is a lot faster and has many distributions already built in. So, I stuck a complex-sampling interface on the the front of it, and made svyVGAM. It’s on github at the moment; I’m hoping to get it on CRAN soon. A Bayesian t-test? http://notstatschat.netlify.com/2020/09/17/a-bayesian-t-test/ Thu, 17 Sep 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/09/17/a-bayesian-t-test/ “How do you know the treatment has the same effect on everyone when you don’t even know whether it makes the outcome go up or down?” –Scott S. Emerson (possibly a paraphrase) What’s the Bayesian equivalent of a basic two-sample $t$-test? This looks like an easy question: $X_i\sim N(\mu_x,\sigma^2)$, $Y_i\sim N(\mu_y,\sigma^2)$, independent flat priors on $\mu_x$ and $\mu_y$ and some sort of reference prior on $\sigma^2$. Back 100 years ago that would have been a good answer (over and above being a revolutionary question). Weights in statistics http://notstatschat.netlify.com/2020/08/04/weights-in-statistics/ Tue, 04 Aug 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/08/04/weights-in-statistics/ There are roughly three and half distinct uses of the term weights in statistical methodology, and it’s a problem for software documentation and software development. Here, I want to distinguish the different uses and clarify when the differences are a problem. I also want to talk about the settings where we know how to use these sorts of weights, and the ones where we don’t. In the interests of doing one thing at a time, I’m going to assume the weights are the right weights and you do actually want to use them; we can have the other discussion some other time. Sourdough happens http://notstatschat.netlify.com/2020/06/23/sourdough-happens/ Tue, 23 Jun 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/06/23/sourdough-happens/ Ok, so everyone and their companion animal is baking sourdough at the moment. I’m still hoping for a bit of comparative advantage for a lazy and woo-free account. I like rye bread, and there are well-established biochemical reasons that it’s hard to make good rye bread with baker’s yeast – in the pH range where yeast grows well, rye amylases mess with the dough structure. Sourdough starter You don’t need any special “wild yeasts” in a sourdough starter – the yeasts are already present on the grains and in the flour. New in the survey package http://notstatschat.netlify.com/2020/04/03/new-in-the-survey-package/ Fri, 03 Apr 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/04/03/new-in-the-survey-package/ Version 4.0 of the survey package is on its way to CRAN. There are two main updates, which improve the estimation of contrasts First, a couple of improvements to the handling of replicates. When svycontrast is used on an object that includes replicate estimates, the estimates will now be transformed and then used to estimate a variance, rather than using the delta method. I think that’s the right thing to do, though you might also want to compute a confidence interval on the original scale and transform the interval. Changing strata mid-stream http://notstatschat.netlify.com/2020/03/27/changing-strata-mid-stream/ Fri, 27 Mar 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/03/27/changing-strata-mid-stream/ The problem We want to estimate the population (or cohort) total of a variable $X$ (actually, we don’t, we want to fit a regression model, but this part of the maths is the same). We’ve got some variables $Z$ that we know for everyone, and we want to do clever sampling. Thanks to Neyman, we know that if we divide the population into strata of size $N_k$ using $Z$, the optimal sample size $n_k$ in each stratum is proportional to $N_k\sigma_k$, where $\sigma_k$ is the variance of $X$ in the stratum. Mapping NZ cases of COVID-19 http://notstatschat.netlify.com/2020/03/26/mapping-nz-cases-of-covid-19/ Thu, 26 Mar 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/03/26/mapping-nz-cases-of-covid-19/ The NZ Ministry of Health has been releasing information about COVID-19 cases, and Chris Knox (of the NZ Herald) has been collating it on github Here, I’m going to map it by District Health Board, using theDHBins package. If you’re following along at home, you need the development version of the package, from github. library(DHBins) ## Loading required package: ggplot2 library(readxl) case_r<-read_xlsx("~/nz-covid19-data/data/dhb-cases.xlsx") case_r<-subset(case_r, `Case Status`=="Confirmed cases" & DHB!="Total") Next, we need population data. Not cross buns http://notstatschat.netlify.com/2020/03/23/not-cross-buns/ Mon, 23 Mar 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/03/23/not-cross-buns/ These are not the standard supermarket hot cross bun, which tends to a cinnamon and sugar flavour profile and a soft, moist crumb. They are denser, chewier, less sweet, and spicier. They go well with strongly flavoured cheese (mature cheddar, oude kaas, washed-rind cheeses). If you want the cross, a good icing mixture is lemon or lime juice and as much icing sugar as you can stir in. A dash of rose water or orange flower water is a nice addition. Quadratic trend tests in survey package http://notstatschat.netlify.com/2020/02/28/quadratic-trend-tests-in-survey-package/ Fri, 28 Feb 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/02/28/quadratic-trend-tests-in-survey-package/ Josh De La Rosa wanted to know how to do quadratic trend tests such as these in the R survey package If you can’t read the image, the code extracts a sets of means for years from 2005 to 2014 and then tests a particular linear contrast of them, with coefficients c(6, 2, -1, -3, -4, -4, -3, -1, 2, 6) I thought this would be easy: just do a linear model with a quadratic in year, and test the quadratic term. The Ihaka Lectures, Episode 4 http://notstatschat.netlify.com/2020/02/19/the-ihaka-lectures-episode-4/ Wed, 19 Feb 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/02/19/the-ihaka-lectures-episode-4/ The Ihaka Lectures in computationally-oriented statistics are back! This year the theme area is social science and policy. We’ve got three great speakers: Simon Jackman is a political scientist. He’s currently CEO of the United States Studies Centre at Sydney University. He was previously a Professor of Political Science and Statistics at Stanford. He’s interested in the increasingly-difficult area of empirical public opinion research, and will talk about recent successes and failures of predictive models of election outcomes. Survey package news http://notstatschat.netlify.com/2020/01/22/survey-package-news/ Wed, 22 Jan 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/01/22/survey-package-news/ Version 3.37 of the survey package is on CRAN now. New features svyquantile now takes account of design degrees of freedom in computing confidence intervals and in turning confidence intervals into standard error estimates. This means results will change (slightly, and for the better). svyivreg for two-stage least squares with instrumental variables. (described here) withPV for ‘plausible value’ analyses now supports replicate-weight designs. ‘Plausible values’ are how education people describe multiple imputation. Multifactor interventions and interactions http://notstatschat.netlify.com/2020/01/09/most-multifactor-intervetions-have-interactions/ Thu, 09 Jan 2020 00:00:00 +0000 http://notstatschat.netlify.com/2020/01/09/most-multifactor-intervetions-have-interactions/ The Multiphase Optimisation Strategy for designing multifactor behavioural interventions should be used more. The idea is that you have a lot of potentially good ideas for things that might work, alone or in combination. You don’t want to test them one at a time, because that takes forever. You don’t want to test all against none, because they might not all be compatible, and in any case you don’t want to be stuck doing them all if you don’t need to. Computer says no http://notstatschat.netlify.com/2019/12/30/computer-says-no/ Mon, 30 Dec 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/12/30/computer-says-no/ This was my reply to the call for comments on the proposed Algorithms Charter Introduction I am a statistician with an interest in the technical details of predictive models and in their social impact. I am interested from the points of view of a researcher, a university teacher, a science communicator, and an immigrant. The proposal The key points in the proposal are laid out as a set of bullets What is 'Data Science Practice'? http://notstatschat.netlify.com/2019/12/25/what-is-data-science-practice/ Wed, 25 Dec 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/12/25/what-is-data-science-practice/ Two years ago I started a 3rd-year (final-year) undergraduate course called ‘Data Science Practice’1. It’s the main new course in our undergraduate Data Science major – we already had a couple of courses in statistical computing, and already used R Markdown starting in second-year, and the Computer Science department teaches algorithms and data structures, and database theory and so on. This year, the course will be taught without me – I’m teaching a postgraduate translation of it. How many giraffes? http://notstatschat.netlify.com/2019/12/01/how-many-giraffes/ Sun, 01 Dec 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/12/01/how-many-giraffes/ Since it’s, if not Christmas, at least Advent, here’s a book review. I’ve been following Janelle Shane’s blog for years. Her book You Look Like a Thing and I Love You came out a few weeks ago; I bought it as my post-exam-grading reward. The blog is a series of examples of surrealist comedy from neural networks either getting things wrong (hallucinating giraffes in every photo) or generating text (the book title was one of the best neural-net pick-up lines). Hexmaps for NZ District Health Boards http://notstatschat.netlify.com/2019/11/07/hexmaps-for-nz-district-health-boards/ Thu, 07 Nov 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/11/07/hexmaps-for-nz-district-health-boards/ I’m involved a research project that, among other things, will be comparing various health variables across NZ District Health Boards (DHBs). In order to make the outputs less boring and (hopefully!) more interpretable, I want some maps. This post is about ‘DHBins’, a set of hexmaps vaguely analogous to the square ‘statebins’ for US states. The code is in the DHBins package I’ll illustrate with some data on immunisation coverage in NZ kids. Some things I don’t like about the Oxford-Munich Code of Conduct http://notstatschat.netlify.com/2019/10/01/some-things-i-don-t-like-about-the-oxford-munich-code-of-conduct/ Tue, 01 Oct 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/10/01/some-things-i-don-t-like-about-the-oxford-munich-code-of-conduct/ The Oxford-Munich Code of Conduct for Professional Data Scientists (http://www.code-of-ethics.org/code-of-conduct/) is worth reading. It’s fairly detailed and has some good features. There are also things I don’t like about it, which are why I didn’t include it in my Data Science Practice course. It’s a bit inconsistent in style at the moment, but (a) it’s a draft under development and (b) I may not have the moral high ground on this point, so that’s not what I’m complaining about. How to review a book http://notstatschat.netlify.com/2019/09/13/how-to-review-a-book/ Fri, 13 Sep 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/09/13/how-to-review-a-book/ The old-fashioned1 way to review a book, such as, say, Randall Munroe’s how to, involves reading it. On the positive side, you can learn about The Effects of Nuclear Explosions on Commercially-Packaged Beverages2 and find out what Colonel Chris Hadfield thinks should be treated as “a big angry hang glider”3.On the negative side, there are all those words and pictures and footnotes and index entries you have to read. (What’s up with the brackets?) http://notstatschat.netlify.com/2019/09/10/what-s-up-with-the-brackets/ Tue, 10 Sep 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/09/10/what-s-up-with-the-brackets/ In various places, from the R help pages to books to course materials, you see R code like (x = runif(10, 0, 10)) ## [1] 1.610466 2.662462 1.517036 1.372483 4.272460 6.402148 8.347196 ## [8] 8.775537 1.863104 8.912241 which displays the value of x. Without the brackets, it doesn’t. Harkanwal Singh, on Twitter, said “I would like to know more”. So, in case he’s not the only one, this is what’s going on. Why isn't rimu tidy? http://notstatschat.netlify.com/2019/09/10/why-isn-t-rimu-tidy/ Tue, 10 Sep 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/09/10/why-isn-t-rimu-tidy/ The rimu package, which I published last week, does not use the tidyverse. The operations that I do on multiple-response data would be easy using dplyr or purrrrrr with the data in long form: all responses stacked. The problem is that dplyr and rlang are not automatically type-safe for this sort of multiple-response data. It seems to be easier to define a multicolumn S3 class, which can then be put into a single column of a data frame, eg A package for multiple-response data http://notstatschat.netlify.com/2019/09/05/a-package-for-multiple-response-data/ Thu, 05 Sep 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/09/05/a-package-for-multiple-response-data/ Multiple-response data is like factor data, except that you can be in more than one category. Examples include what is your ethnicity? (or, in the US, race/ethnicity, sigh) what social media do you use? what countries have you been to? what birds did you see this week? I have the first version of a package to manipulate this sort of data, called rimu. The name stands for responses in multiplex, but rimu is also the name of a New Zealand tree, Dacrydium cupressinum, an attractive conifer with reddish wood. Adding new functions to the survey package http://notstatschat.netlify.com/2019/07/16/adding-new-functions-to-the-survey-package/ Tue, 16 Jul 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/07/16/adding-new-functions-to-the-survey-package/ I had an email request about making ivreg work with the survey package. That’s AER::ivreg, which does two-stage least-squares estimation with instrumental variables. The steps are See if it accepts weights and does the right thing for point estimation If so, work out how to get the complex-survey variances Test to make sure it’s getting the right answer In this case, the first step was fairly straightforward. The function accepts weights and passes them to lm. Denominator degrees of freedom in svyglm http://notstatschat.netlify.com/2019/06/26/denominator-degrees-of-freedom-in-svyglm/ Wed, 26 Jun 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/06/26/denominator-degrees-of-freedom-in-svyglm/ Attention Conservation Notice: This is a working note; when I understand it better, there will be changes in the survey package. The design degrees of freedom for a stratified, clustered design with $M$ clusters and $H$ strata is $d=M-H$. This is a straightforward definition, since the Horvitz–Thompson variance estimator for a mean or total is a variance of $M$ cluster summaries after subtracting off $H$ stratum means. While the definition is only straightforward for single-stage designs, the public-use versions of nearly all surveys are analysed as if they were single-stage designs. Wald, score, LRT: the picture http://notstatschat.netlify.com/2019/06/20/wald-score-lrt-the-picture/ Thu, 20 Jun 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/06/20/wald-score-lrt-the-picture/ One issue in teaching generalised linear models (or likelihood theory) is the relationship between the Wald, score, and likelihood ratio tests. I have a picture. Let’s make up a score function $U(\theta)$, in this case for a trivial binomial model, and draw it. logit <-function(p) log(p/(1-p)) expit <-function(x) exp(x)/(1+exp(x)) U<-function(theta) 11/12-expit(theta) thetahat<-logit(11/12) curve( U(x),from=0, to =3, xlab=expression(theta),ylab=expression(U(theta))) abline(h=0,lty=2) abline(v=0,lty=2) The likelihood ratio statistic is twice the area under the curve \[-2(\ell(\hat\theta)-\ell(0))= 2 \int_0^{\hat\theta}U(\theta)\,d\theta\] Analysing the mouse microbiome autism data http://notstatschat.netlify.com/2019/06/16/analysing-the-mouse-autism-data/ Sun, 16 Jun 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/06/16/analysing-the-mouse-autism-data/ The issue A paper reporting the induction of autism-type behaviour in mice by fecal microbiome transplants from humans was recently published in Cell. Some people on Twitter were discussing subplots E, F, and G of Figure 1, which report behavioral comparisons of the mice between fecal donors with and without ASD. The expressed view on Twitter was the the plots weren’t consistent with the $p$-values given. They didn’t entirely need to be, since the $p$-values weren’t from a simple two-group comparison, but even taking that into account I was surprised. Confidence intervals: not a very strong property http://notstatschat.netlify.com/2019/06/11/confidence-intervals-not-a-very-strong-property/ Tue, 11 Jun 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/06/11/confidence-intervals-not-a-very-strong-property/ It’s important for non-Bayesians (or non-exclusively-Bayesians) to remember that being a 95% confidence interval procedure is a fairly weak property. It’s not that confidence intervals are necessarily bad, but if they aren’t, it’s because of other requirements. As an extreme case, consider the all-purpose data-free exact confidence interval procedure for any real quantity: roll a d20 and set the confidence interval to be the empty set if you roll 20, and otherwise to be $\mathbb{R}$. Design degrees of freedom: brief note http://notstatschat.netlify.com/2019/06/08/design-degrees-of-freedom-brief-note/ Sat, 08 Jun 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/06/08/design-degrees-of-freedom-brief-note/ An important concept in multistage survey analysis is the design degrees of freedom, which describes (or estimates) how many independent observations go into calculating variances, in a similar way to error degrees of freedom in experimental designs. In straightforward multistage designs the design df is the number of primary sampling units minus the number of strata, because each PSU provides data to supply degree of freedom and each stratum implies a constraint that removes a degree of freedom. Mean People Tweet http://notstatschat.netlify.com/2019/05/24/mean-people-tweet/ Fri, 24 May 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/05/24/mean-people-tweet/ There is discussion from time to time on Kiwi Twitter about which public figures get treated worse on Twitter. Eric Crampton suggested that it would be easy to answer this question empirically, by analysing tweet sentiment. I wasn’t convinced, but I tried it. This post is about what I found. First, we need some way of classifying sentiment. I’ve got lists of about 2000 positive and 5000 negative words, collected by Bing Liu. The Reeferendum http://notstatschat.netlify.com/2019/05/07/the-reeferendum/ Tue, 07 May 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/05/07/the-reeferendum/ Elections and opinion polls look a bit similar from a distance, but they’re very different beasts. An election is a decision mechanism, an opinion poll is an estimation procedure. If you want an election, the Electoral Commission does an excellent job; if you want an opinion survey, you might try Colmar Brunton or Stats NZ. Binding referendums1, in the New Zealand model, are like elections. There’s a proposed change in the law, which we hope has been carefully drafted, put out for public commemt, and the whole bit, and which is then put up for vote. Local asymptotic minimax, and nearly-true models http://notstatschat.netlify.com/2019/04/30/regularity-local-minimax-and-nearly-true-models/ Tue, 30 Apr 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/04/30/regularity-local-minimax-and-nearly-true-models/ I’ve written a bunch of times about nearly-true models. The idea is that you have some regression model for $Y|X$ you’re trying to fit with data from a two-phase sample with known sampling probabilities $\pi_i$ for individual $i$. You know $Y$ and some auxililary variables $A$ for everyone, but you know $X$ only for the subsample. If you had complete data, you’d fit a particular parametric model for $Y|X$, with parameters $\theta$ you’re interested in and nuisance parameters $\eta$, call it $P_{\theta,\eta}$. Survey package update http://notstatschat.netlify.com/2019/04/28/survey-package-update/ Sun, 28 Apr 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/04/28/survey-package-update/ Version 3.36 of the survey package and version 2.4 of mitools are up on CRAN. There’s one notable new feature in both of them: handling ‘plausible values’, where you have some sets of multiply-imputed variables just as additional columns in a largely non-imputed data set. There are two implementations behind withPV, controlled by the rewrite= option. You have variables PV1MATH, PV2MATH,…,PV5MATH and some code with a variable maths that you want to run with maths being each of the plausible values in turn. That’s for remembrance http://notstatschat.netlify.com/2019/04/24/that-s-for-remembrance/ Wed, 24 Apr 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/04/24/that-s-for-remembrance/ Rosemarinus officinalis, by Samules on Pixabay April 25th is not the anniversary of a victory, or an armistice, or a successfull retreat like Dunkirk, or even a tragic last stand like Masada. It was the sort of badly-planned, poorly-executed debacle that typified the Great War. The day is remembered because it was the first day that large numbers of soldiers from Australia and New Zealand were killed. April 25th is an ideal setting for a military commemoration. Handling ‘plausible values’ in surveys http://notstatschat.netlify.com/2019/04/21/handling-plausible-values-in-surveys/ Sun, 21 Apr 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/04/21/handling-plausible-values-in-surveys/ Surveys (especially educational surveys) have a thing called ‘plausible values’, which are a form of multiple imputation, only by design rather than because of non-response. To reduce effort, not everyone answers every question. Often, there are a lot of variables that don’t need imputing, and a few that do. The data example I showed in the last post, for mixed models, has five plausible values for the maths score. I only used PV1MATH. Progress on linear mixed models for surveys http://notstatschat.netlify.com/2019/04/19/progress-on-linear-mixed-models-for-surveys/ Fri, 19 Apr 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/04/19/progress-on-linear-mixed-models-for-surveys/ In our last episode, we worried about the penalised least squares criterion for linear mixed models. The linear mixed model is \[Y=X\beta+Zb+\epsilon\] where $b\sim N(0, \sigma^2V_\theta)$ and $\epsilon\sim\sigma^2$, and where $\theta$ are variance parameters. It’s convenient to write $b=\Lambda_\theta u$ for iid standard Normal $u$, where $\Lambda_\theta$ is then a square root of $V_\theta$. The penalised least squares approach says that for given $\theta$, we choose $b$ and $\beta$ to minimise \[\|Y-X\beta-Zb\|_2^2+\|u\|_2^2. Hypergraph network meta-analysis http://notstatschat.netlify.com/2019/03/26/hypergraph-network-meta-analysis/ Tue, 26 Mar 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/03/26/hypergraph-network-meta-analysis/ ‘Network meta-analysis’ is the only term I’ve coined that’s actually entered the general biostatistical vocabulary. In network meta-analysis we work with a network of randomised trials, where the nodes are interventions and the edges are trials. A single edge represents a direct randomised comparison; a path represents an indirect (but unconfounded) estimate from multiple trials; we can combine multiple paths between interventions using a linear model. For example, if $\log RR_{ij,j}$ is the log relative risk in the $k$th trial of treatments $i$ and $j$, and it has variance $\sigma^2_{ij,k}$ \[\log RR_{ij,k} = \beta_i-\beta_j+\epsilon_{ij,k}\] where $\epsilon_{ij,k}\sim N(0,\sigma^2_{ij,k})$. The school climate strike http://notstatschat.netlify.com/2019/03/12/the-school-climate-strike/ Tue, 12 Mar 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/03/12/the-school-climate-strike/ I have seen the school ‘climate strike’ in NZ being described as a publicity stunt that won’t provide any real solutions. No shit. That’s not a criticism, any more than it’s a criticism of Shaun Hendy’s no-fly-year in 2018. The point of public protest — everything from ‘peaceably assemble and petition the government’ to dumping manure on the streets of Paris — isn’t to solve a problem, it’s to get the problem on the agenda. Normal horizontiles http://notstatschat.netlify.com/2019/03/04/normal-horizontiles/ Mon, 04 Mar 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/03/04/normal-horizontiles/ From XKCD today How can we check this calculation? First, we need to know where the lines are on the y-axis. They are separated by 52.7% of the height of the distribution, and looks as if they are meant to exclude the same height above and below. We don’t need to worry about the mean of the distribution (obviously), or the scale (less obviously). The reason the scale is not needed is that rescaling the x axis shrinks all three of the areas under the curve by the same factor, and since they add up to 1, they stay the same. Displaying bus punctuality http://notstatschat.netlify.com/2019/03/01/displaying-bus-punctuality/ Fri, 01 Mar 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/03/01/displaying-bus-punctuality/ A couple of years ago, I stored a lot of Auckland bus location data for what was going to be a news story. It’s about time I did something with it, so I’m updating the analysis and I’ll be using it as a class example. The data come from the Auckland Transport real-time API, for which Auckland Transport should be congratulated. Anyone can get an API key and use the data. Absolutely no warranty? http://notstatschat.netlify.com/2019/02/18/absolutely-no-warranty/ Mon, 18 Feb 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/02/18/absolutely-no-warranty/ Someone on Twitter (who I’m reluctant to out, considering) said they’d got reviews back from an epidemiology journal, and one reviewer had cautioned against the use of R, because “the opening sentence in an R output points out that basically you use it at your own risk and the contributors to R are not accountable for any errors.” It’s been while since I’ve seen this one, which is a clear symptom of not having read licence agreements for other statistical software. What have I got against the Shapiro-Wilk test? http://notstatschat.netlify.com/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/ Sat, 09 Feb 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/ The Shapiro-Wilk test is a test of the null hypothesis that data come from a Normal distribution, with power against a wide range of alternatives. So what do I have against it? Well, to start with, it’s a test of the null hypothesis that data come from a Normal distribution, with power against a wide range of alternatives. There are two reasons you might want a test of the hypothesis that data come from a particular distribution $P$ or a particular set of distributions (ie, model) ${\cal P}_\theta$. How do you tell what packages to trust? http://notstatschat.netlify.com/2019/02/04/how-do-you-tell-what-packages-to-trust/ Mon, 04 Feb 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/02/04/how-do-you-tell-what-packages-to-trust/ I’m not on that list, but I do have reckons anyway. First, there are really two questions: is the method useful to you, and is the implementation doing it the way you want? The first question is important. There’s no benefit in having a well-coded implementation of the Shapiro-Wilk test unless you have a good reason to use the Shapiro-Wilk test1. The first question is not a coding question, but the answer is similar to the answer for the coding question. Recognising when you don’t know http://notstatschat.netlify.com/2019/02/01/recognising-when-you-don-t-know/ Fri, 01 Feb 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/02/01/recognising-when-you-don-t-know/ Saying that you don’t know something can be hard for people; it’s also hard for prediction algorithms. There’s an example in the xgboost package for R involving the classification of mushrooms. The goal is to use information about the appearance of the mushrooms to decide if they are edible or not. It’s clear that this is a machine learning problem rather than a data science problem, because the version of the data in the xgboost package doesn’t say which output value means ‘edible’ and which one means ‘inedible’. Two quick survey items http://notstatschat.netlify.com/2019/01/26/two-quick-survey-items/ Sat, 26 Jan 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/01/26/two-quick-survey-items/ Can we invent the case-control design? Classical survey analysis is about means and totals, and the way to adapt it to more interesting parameters is to write the parameter as the mean of its influence functions (delta-betas, jackknife values, etc) Suppose we knew for everyone in a population (maybe an HMO) whether they had a disease ($Y=1$ or didn’t ($Y=0$) and we wanted to take a sample, measure a variable $X$, and do logistic regression. Another way to see why mixed models in survey data are hard: http://notstatschat.netlify.com/2019/01/18/another-way-to-see-why-mixed-models-in-survey-data-are-hard/ Fri, 18 Jan 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/01/18/another-way-to-see-why-mixed-models-in-survey-data-are-hard/ Suppose you have a (potentially unequal-probability) sample of schools, and within each school a (potentially unequal-probability) sample of students, and you want to fit a linear mixed model. In fact, let’s take the brutally simple example of a random intercept model: \[Y_{ij}=X_{ij}\beta+b_i+e_{ij}\] where $b\sim N(0,\tau^2$)$. With population data, the penalised least squares formulation of this model (which Doug Bates likes) involves minimising \[\sum_i\sum_j (y_{ij}-\hat y_{ij})^2+\sum_i u_i^2\] where $u_i=b_i/\tau$. You can use the EM algorithm (if you have all week) or you can rewrite as a least-squares problem in augmented data; right now I don’t care how you do it. The Ihaka Lectures 3: Rise of the Machine Learners http://notstatschat.netlify.com/2019/01/11/the-ihaka-lectures-3-rise-of-the-machine-learners/ Fri, 11 Jan 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/01/11/the-ihaka-lectures-3-rise-of-the-machine-learners/ They’re back! On Wednesday evenings in March (and streaming on the internet) the University of Auckland Stats department will again be hosting the Ihaka Lectures. This year the theme is statistical learning/machine learning/predictive algorithms, and we have four speakers Bernhard Pfahringer is Professor of Computer Science at the University of Waikato. He is a member of the Weka project, New Zealand’s other famous open-source data science contribution. He will talk about the design and development of Weka and more recent projects. Bayesian Surprise — the Shiny app http://notstatschat.netlify.com/2019/01/04/bayesian-surprise-the-shiny-app/ Fri, 04 Jan 2019 00:00:00 +0000 http://notstatschat.netlify.com/2019/01/04/bayesian-surprise-the-shiny-app/ I wrote a while back about a toy case of the Bayesian surprise problem: what does Bayes Theorem tell you to believe when you get really surprising data. The one-dimensional case is a nice math-stat problem, if you like that sort of thing, but maybe you’d rather have the calculations done for you. Here’s an app The mathematical setup is that you have a prior distribution for a location parameter $\theta$ centered at zero, and you see a data point $x$ that’s a long way from zero. What are packages for? http://notstatschat.netlify.com/2018/12/17/what-are-packages-for/ Mon, 17 Dec 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/12/17/what-are-packages-for/ It’s an interesting question, but the implication of wasted depends not just on the actual statistics about package survival (which we’ll get to), but on why people write packages. And, I suppose, on why they should write packages. One reason people write packages is to improve other people’s data analysis, certainly. But it’s not the only reason, nor should it be. People write packages to provide reference implementations of new statistical methods. svycontrast http://notstatschat.netlify.com/2018/12/10/svycontrast/ Mon, 10 Dec 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/12/10/svycontrast/ I got asked for more detail about the svycontrast() function, so I thought I’d post it here too. The function is related to the CONTRASTS you get in SAS, but focused on estimation rather than testing. The input to svycontrast() is a $p$-vector of estimates $\hat\theta$ (which I’ll consider as a column vector) and an estimated $p\times p$ covariance matrix $\hat\Xi$ There are two main cases: Linear Given a $p$-vector of coefficients $b$, the function computes $b^T\hat\theta$ and $b^T\hat\Xi b$. Finding principal components without even looking? http://notstatschat.netlify.com/2018/11/26/finding-principal-components-without-even-looking/ Mon, 26 Nov 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/11/26/finding-principal-components-without-even-looking/ Via Scott Aaronson’s blog I found an arXiv abstract and then an early paper (PDF) about doing singular value decomposition of an $m\times n$ matrix in less than $O(mn)$ time. That is, you could estimate population structure with principal components of a genotype matrix or work out tail probabilities for a quadratic-form-based test in less time than it takes to actually look at the data. That’s obviously impossible, and so that’s not what the paper actually says. Come work with us http://notstatschat.netlify.com/2018/11/04/come-work-with-us/ Sun, 04 Nov 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/11/04/come-work-with-us/ We have four academic positions open in the Department of Statistics. Come work with us! (j.e.mcgowan on Flickr, CC-BY) There are three standard academic positions, at lecture, senior lecturer, and associate professor level. In American these would approximately translate as hard-money tenure-track assistant, associate, and full professor. There is also one position for a professional teaching fellow – a full-time, permanent job that focuses on teaching. (Salman Javed on Flickr, CC-BY-SA) Progress on svy2lme http://notstatschat.netlify.com/2018/10/19/progress-on-svy2lme/ Fri, 19 Oct 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/10/19/progress-on-svy2lme/ The svy2lme package for linear mixed models under complex sampling may still contain nuts, but at least the user interface has settled down and it gives plausible answers for some toy examples. The recent change is to compute pairwise sampling probabilities from a survey design object, rather than some horrible set of separate specifications. It still doesn’t support complicated PPS designs, but since the survey package does, that should be feasible. Survey package update http://notstatschat.netlify.com/2018/10/12/survey-package-update/ Fri, 12 Oct 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/10/12/survey-package-update/ There’s a new version of the survey package on CRAN, version 3.34. Mostly this is bug fixes and minor enhancements accumulated over rather too long since the last update. There are a couple of things worth noting specifically, though. The first is a change to svyglm with replicate weights. When fitting generalised linear models with large weights (eg from US national surveys), you can run into numerical instabilities. I’ve handled this for a long time by rescaling the weights inside svyglm. The Kiwi PRNG http://notstatschat.netlify.com/2018/10/04/the-kiwi-prng/ Thu, 04 Oct 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/10/04/the-kiwi-prng/ As I’ve written before, New Zealand has a National Pseudo-Random Number Generator. It’s kept in Part 5 of Schedule 1A to the Local Electoral Regulations, Clauses 41-48. And there’s a bug in it. The generator is obviously intended to be the Wichmann-Hill generator, perhaps because of the paper by Hill, Wichmann, and Woodall (1987) presenting a Pascal program to count Single Transferable Vote. Translating the regulations to code gives How to write a racist AI in R without really trying http://notstatschat.netlify.com/2018/09/27/how-to-write-a-racist-ai-in-r-without-really-trying/ Thu, 27 Sep 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/09/27/how-to-write-a-racist-ai-in-r-without-really-trying/ Last year, Robyn Speer wrote a really great post How to make a racist AI without really trying. Go read it. The idea is to do sentiment analysis with obvious, off-the-shelf tools. As the post says So that’s what we’re going to do here, following the path of least resistance at every step, obtaining a classifier that should look very familiar to anyone involved in current NLP. The original post used Python and I’m teaching an undergraduate data science course using R at the moment, so I wanted an R version. Journalism and cyber-bullying http://notstatschat.netlify.com/2018/09/11/journalism-and-cyber-bullying/ Tue, 11 Sep 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/09/11/journalism-and-cyber-bullying/ Newsroom, an online-only New Zealand news site, has written a series of stories critical of Sir Ray Avery and his R&D efforts in medical devices. According to the most recent story, Sir Ray is attempting to use the Harmful Digital Communications Act to get these stories removed Avery has told Netsafe, the legal agent for considering complaints under the Act, the reports have caused him serious emotional distress and amount to a form of digital harm - and wants Newsroom to consider removing them and to agree not to write further news stories about him. What can data science add to statistics education? http://notstatschat.netlify.com/2018/08/28/what-can-data-science-add-to-statistics-education/ Tue, 28 Aug 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/08/28/what-can-data-science-add-to-statistics-education/ (for Deborah Nolan and Louise Ryan, ISCB/ASC 2018, after Henry Reed) Today we have naming of stats. Yesterday We had assumptions. And tomorrow morning We shall have testing of assumptions. But to-day Today we have naming of stats. Data sparkles and flashes through all of the students’ phones. And today we have naming of stats. This is the rank-sum Wilcoxon test. And this is the one-sample Wilcoxon test, whose use you will see ISCB/ASC talk http://notstatschat.netlify.com/2018/08/26/iscb-asc-talk/ Sun, 26 Aug 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/08/26/iscb-asc-talk/ Notes and references for my keynote at the joint International Society for Clinical Biostatistics and Australian Statistical Conference: Abstracts from the JSM on imputation and weighting for audit subsets: Shepherd & Giganti; Shaw & Oh Preprint on nearly-true models, and a couple of old blog posts Blog post on relative efficiency of weighted estimation in case-control studies The paper that coined “using the whole cohort” software for multiple imputation on large databases: missForest, MIDAS The relationship between AIPW, calibration of weights, and adjusting for baseline Paper by Peisong Han showing that multiple imputation gives the optimal double-robust estimator (and therefore the optimal weighted estimator when the sampling probabilities are correct) review paper on survey-weighted regression by me and Alastair Scott Photos from unsplash. Leaflet and buses http://notstatschat.netlify.com/2018/08/14/leaflet-and-buses/ Tue, 14 Aug 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/08/14/leaflet-and-buses/ Where are the buses? Wellington’s bus system has been the subject of negative attention in the news and on Twitter. Also, I’m teaching a course in Data Science Practice and we’re just getting to a lab on maps with Leaflet. So I thought I’d make a map of Wellington buses and their lateness – people do tend to overestimate problems with public transit, and if they aren’t overestimating it, that’s also important to know. Testing probability distribution generators http://notstatschat.netlify.com/2018/08/01/testing-probability-distribution-generators/ Wed, 01 Aug 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/08/01/testing-probability-distribution-generators/ In the ‘regression tests’ that are part of any change to the base-R source code, there’s a file called p-r-random-tests.R. People notice it from time to time because the tests sometimes fail. That’s what is supposed to happen. Testing random number generators is hard, because it’s hard to specify what the results should be: you need statistics. Fortunately, we have statistics, so it’s not impossible. The random tests check that, eg, pnorm() is not ruled out as the cumulative distribution function of numbers from rnorm(). Quoting and macros in R http://notstatschat.netlify.com/2018/07/30/quoting-and-macros-in-r/ Mon, 30 Jul 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/07/30/quoting-and-macros-in-r/ Miles McBain has a nice post about quoting in R and the tidyeval procedure. In it, there’s this footnote In truth there are other types of calls, and the ones Lisp nuts really bang on about are macro calls In this post I want to talk about the similarities between the tidyversatile approach to quasiquoting and the base-R approach, as an introduction to banging on about macro calls. e-bike: the reboot http://notstatschat.netlify.com/2018/07/17/e-bike-the-reboot/ Tue, 17 Jul 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/07/17/e-bike-the-reboot/ Q: What do you mean, “reboot”? Like, a new e-bike with only a vague brand-name resemblance to the original? A: Pretty much. I’m 200km into a SmartMotion Pacer GT. Q: How fast does it go? A: Haven’t we talked about that question? Q: A: Ok. It’s faster than the old one, because it has a wider range of gears. I get up to 45km/h down Manukau Rd toward Royal Oak. Interlingual http://notstatschat.netlify.com/2018/07/11/interlingual/ Wed, 11 Jul 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/07/11/interlingual/ I don’t normally explain jokes, but it’s clear from the useR conference that the name of the new R package reticulate divides people into two groups: amused or bemused. The word reticulate barely exists in modern English. It comes from the Latin reticulum, ‘small net’, the diminutive of rete, net. In NZ and Australian usage, mains water – supplied by a network of pipes – is called a ‘reticulated water supply. Spell my name with a ‘v' http://notstatschat.netlify.com/2018/06/24/spell-my-name-with-a-v/ Sun, 24 Jun 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/06/24/spell-my-name-with-a-v/ Admittedly, it’s conceivable for parents to make a simple error in assigning a child’s name. In a fictional example from Kerry Greenwood’s ‘Phryne Fisher’ series, the protagonist’s hungover father accidentally named her after Phryne the Greek courtesan rather than Psyche. In the real world, Isaac Asimov’s father incorrectly transliterated the Cyrllic Азимов as ‘Asimov’ rather than ‘Azimov’. By and large, though, the idea that someone is simply incorrect about their name or their child’s name falls under “not even wrong”. Statistical software matters http://notstatschat.netlify.com/2018/06/09/statistical-software-matters/ Sat, 09 Jun 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/06/09/statistical-software-matters/ This is a picture of all the genetic associations found in genome-wide association studies, sorted by chromosome. You can find more detail at the NHGRI GWAS catalog There are two chromosomes with many fewer associations. One is the Y chromosome. There isn’t much there because there isn’t much there. The other is the X chromosome. There isn’t much there because GWAS took a lot longer to get started for the X chromosome, and that’s partly for software reasons. Survey analysis in SQL http://notstatschat.netlify.com/2018/06/09/survey-analysis-in-sql/ Sat, 09 Jun 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/06/09/survey-analysis-in-sql/ Charco Hui, as his Honours project in Statistics, has been writing a package for complex-survey analysis using dplyr and dbplyr. It’s here. At the moment it has only been tested with MonetDB, using the github version (0.5.2) of MonetDBlite, but it should work with many other databases (not SQLite, at the moment). I hope it’s still under development: the approach does seem to be useful for large survey data sets – and for smaller data sets the dplyr version is faster than the survey package, though more limited. New blog home http://notstatschat.netlify.com/2018/06/05/new-blog-home/ Tue, 05 Jun 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/06/05/new-blog-home/ One way to prove you can really keep up with the cool kids is to move your blog to GitHub the day after they get bought by Microsoft. I’m not actually worried by that: one of the key features of a git repository is that it doesn’t have the only copy of any of your stuff. The main motivation for switching was to use blogdown rather than Tumblr, because my blog is mostly text. Biased and Inefficient http://notstatschat.netlify.com/about/ Fri, 01 Jun 2018 00:00:00 +0000 http://notstatschat.netlify.com/about/ I’m a statistical researcher in Auckland. This blog is for things that don’t belong on my department’s blog, StatsChat. I also tweet as @tslumley and I’m on Mastodon as @tslumley@wandering.shop Graduation http://notstatschat.netlify.com/2018/05/14/graduation/ Mon, 14 May 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/05/14/graduation/ So, I gave one of the graduation addresses during Silly Hat Week last week. It’s not my usual style of writing, but since there’s no point being embarrassed about a speech you’ve already given in front of your boss, your boss’s boss, and 2000 other people, I thought I’d post it here. Chancellor, Vice-Chancellor, Members of Council, fellow Members of the University, Graduands, families and friends: kia ora tātou svylme http://notstatschat.netlify.com/2018/04/01/svylme/ Sun, 01 Apr 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/04/01/svylme/ I’m working on an R package for mixed models under complex sampling. It’s here. At the moment, it only tries to fit two-level linear mixed models to two-stage samples – for example, if you sample schools then students within schools, and want a model with school-level random effects. Also, it’s still experimental and not really tested and may very well contain nuts. The package uses pairwise composite likelihood, because that’s a lot easier to implement efficiently than the other approaches, and because it doesn’t have the problems with nonlinearity and weight scaling. Small p hacking http://notstatschat.netlify.com/2018/03/23/small-p-hacking/ Fri, 23 Mar 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/03/23/small-p-hacking/ The proposal to change p-value thresholds from 0.05 to 0.005 won’t die. I think it’s targeting the wrong question: many studies are too weak in various ways to provide the sort of reliable evidence they want to claim, and the choices available in analysis and publication process eat up too much of that limited information. If you use p-values to decide what to publish, that’s your problem, and that’s what you need to fix. Chebyshev’s inequality and `UCL’ http://notstatschat.netlify.com/2018/03/15/chebyshevs-inequality-and-ucl/ Thu, 15 Mar 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/03/15/chebyshevs-inequality-and-ucl/ Chebyshev’s inequality (or any of the other transliterations of Чебышёв) is a simple bound on the proportion of a distribution that can be far from the mean. The Wikipedia page, on the other hand, isn’t simple. I’m hoping this will be more readable. We have a random quantity $X$ with mean $\mu$ and variance $\sigma^2$, and – knowing nothing else – we want to say something about the probability that $X-\mu$ is large. Why pairwise likelihood? http://notstatschat.netlify.com/2018/03/13/why-pairwise-likelihood/ Tue, 13 Mar 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/03/13/why-pairwise-likelihood/ Xudong Huang and I are working on fitting mixed models using pairwise composite likelihood. JNK Rao and various co-workers have done this in the past, but only for the setting where the structure (clusters, etc) in the sampling is the same as in the model. That’s not always true. The example that made me interested in this was genetic analyses from the Hispanic Community Health Survey. The survey is a multistage sample: census block groups and then households. Faster generalised linear models in largeish data http://notstatschat.netlify.com/2018/03/05/faster-generalised-linear-models-in-largeish-data/ Mon, 05 Mar 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/03/05/faster-generalised-linear-models-in-largeish-data/ There basically isn’t an algorithm for generalised linear models that computes the maximum likelihood estimator in a single pass over the $N$ observatons in the data. You need to iterate. The bigglm function in the biglm package does the iteration using bounded memory, by reading in the data in chunks, and starting again at the beginning for each iteration. That works, but it can be slow, especially if the database server doesn’t communicate that fast with your R process. Useful debugging trick http://notstatschat.netlify.com/2018/01/31/useful-debugging-trick/ Wed, 31 Jan 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/01/31/useful-debugging-trick/ If you have a thing with lots of indices, such as a fourth-order sampling probability $\pi_{ijk\ell}$ (the probability that individuals $i$, $j$, $k$ and $\ell$ are all sampled), there will likely be scenarios where it has lots and lots of symmetries. A useful trick is to write a wrapper that checks them: FourPi<-function(i,j,k,l){ answer <- FourPiInternal(i,j,k,l) sym <- FourPiInternal(j,i,k,l) if (abs((answer-sym)/(answer+sym))>EPSILON) stop(paste(i,j,k,l)) answer } Other useful tricks: The score (deriviative of loglikelihood) has mean zero at the true parameters under sampling from the model, even in finite samples Quite a few design-based variance estimators are unbiased for the sampling variance even in small samples. More tests for survey data http://notstatschat.netlify.com/2018/01/22/more-tests-for-survey-data/ Mon, 22 Jan 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/01/22/more-tests-for-survey-data/ If you know about design-based analysis of survey data, you probably know about the Rao-Scott tests, at least in contingency tables. The tests started off in the 1980s as “ok, people are going to keep doing Pearson $X^2$ tests on estimated population tables, can we work out how to get $p$-values that aren’t ludicrous?” Subsequently, they turned out to have better operating characteristics than the Wald-type tests that were the obvious thing to do – mostly by accident. The Ihaka Lectures http://notstatschat.netlify.com/2018/01/22/the-ihaka-lectures/ Mon, 22 Jan 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/01/22/the-ihaka-lectures/ They’re back! This year our theme is visualisation. The lectures will again run on Wednesday evenings in March. The three speakers work in different areas of data visualisation: collect the complete set! Paul Murrell is an Associate Professor in Statistics here in Auckland. He’s a member of the R Core Development Team, and responsible for a lot of graphics infrastructure in R. The ‘grid’ graphics system grew out of his PhD thesis with Ross Ihaka. As far as it goes http://notstatschat.netlify.com/2018/01/20/as-far-as-it-goes/ Sat, 20 Jan 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/01/20/as-far-as-it-goes/ I’ve been reading two somewhat depressing documents today. The American Statistical Association has put out a position paper titled “Overview of Statistics as a Scientific Discipline and Practical Implications for the Evaluation of Faculty Excellence“. It says, in the executive summary Statistics is at the same time a dynamic, stand-alone science with its own core research agenda and an inherently collaborative discipline, developing in response to scientific needs. In this sense, statistics fundamentally differs from many other domain-specific disciplines in science. breakInNamespace http://notstatschat.netlify.com/2018/01/15/breakinnamespace/ Mon, 15 Jan 2018 00:00:00 +0000 http://notstatschat.netlify.com/2018/01/15/breakinnamespace/ Attention Conservation Notice: I’m putting this in a blog post in the hope it makes it easier for other people to find when they encounter the problem. The !! and !!! quasiquotation syntax in R’s tidyverse will break if you run them through the parser and deparser. This means: Printing out the code of a function at the command line may give the wrong code Functions like fix(), fixInNamespace(), and edit() may break functions using quasiquotation. e-bike-onomics http://notstatschat.netlify.com/2017/12/30/e-bike-onomics/ Sat, 30 Dec 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/12/30/e-bike-onomics/ If you own an e-bike, you get used to certain questions: how fast does it go, how often does it need to be charged, how much did it cost? My e-bike was a bottom-end one two years ago – I didn’t know if I’d end up using it, so I didn’t spend more than I had to. Since then, the quality has generally gone up, and so has the price. Statistics on pairs http://notstatschat.netlify.com/2017/12/26/statistics-on-pairs/ Tue, 26 Dec 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/12/26/statistics-on-pairs/ I’m interested in estimation for complex samples from structured data — clustered, longitudinal, family, network — and so I’m interested in intuition for estimating statistics of pairs, triples, etc. This turns out to be surprisingly hard, so I want easy examples. One thing I want easy examples for is the relationship between design-weighted $U$-statistics and design-weighted versions of their Hoeffding projections. That is, if you write a statistic as a sum over all pairs of observations, you can usually rewrite it as a sum of a slightly more complicated statistic over single observations, and I want to think about whether the weighting should be done before or after you rewrite the statistic. How to add chi-squareds http://notstatschat.netlify.com/2017/12/06/how-to-add-chi-squareds/ Wed, 06 Dec 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/12/06/how-to-add-chi-squareds/ A quadratic form in Gaussian variables has the same distribution as a linear combination of independent $\chi^2_1$ variables – that’s obvious if the Gaussian variables are independent and the quadratic form is diagonal, and you can make that true by change of basis. The coefficients in the linear combination are the eigenvalues $\lambda_1,\dots,\lambda_m$ of $VA$, where $A$ is the matrix representing the quadratic form and $V$ is the covariance matrix of the Gaussians. Secret Santa collisions http://notstatschat.netlify.com/2017/11/25/secret-santa-collisions/ Sat, 25 Nov 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/11/25/secret-santa-collisions/ Attention Conservation Notice: while this probability question actually came up in in real life, that’s just because I’m a nerd. “Secret Santa” is a Christmas tradition for taming the gift-giving problem in offices, groups of acquaintances, etc. Instead of everyone wondering which subset of people they should give a gift to, each person is randomly assigned one recipient and has to give a gift (with a relatively low upper bound on cost) to that one person. When all U-shaped curves look the same to you http://notstatschat.netlify.com/2017/11/23/when-all-u-shaped-curves-look-the-same-to-you/ Thu, 23 Nov 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/11/23/when-all-u-shaped-curves-look-the-same-to-you/ There was (as usual) controversy about some of the NCEA maths questions this year. Most of the controversy was about whether they assumed knowledge that the students hadn’t been told to know, but I’m going to worry about the pseudocontext problem One question had the set up and then (from the Herald), the question was The maths problem is fine as a quadratic equation, I suppose. But the physics is wrong and the maths isn’t how any sane person would answer the question in reality. Means of maximums http://notstatschat.netlify.com/2017/11/08/means-of-maximums/ Wed, 08 Nov 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/11/08/means-of-maximums/ From a math point of view, it’s an interesting example of how the mean of the maximum of a set of random variables is higher than the max of the individual means – Andrew Gelman Controlling the maximum of a set of random variables is an important problem in mathematical statistics, and it’s surprising how far a comparatively crude approach can be stretched. Suppose you have $m$ random variables $X_1$, . Haere mai, statistical computing folks http://notstatschat.netlify.com/2017/09/26/haere-mai-statistical-computing-folks/ Tue, 26 Sep 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/09/26/haere-mai-statistical-computing-folks/ Later this year, Auckland is hosting the Asian regional meeting of the International Association for Statistical Computing. For the benefit of conference-goers, here’s a brief introduction to the locale. Nomenclature: The Owen G. Glenn Building (OGGB, or building 260, in university abbreviations) is named after Owen G. Glenn. He’s a New Zealand businessman and philanthropist. Auckland is named after George Eden. The subantarctic Auckland Islands were not named after George but after his father William Eden. A genome analogy http://notstatschat.netlify.com/2017/09/25/a-genome-analogy/ Mon, 25 Sep 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/09/25/a-genome-analogy/ DNA looks like a zipper. If you scale it up to a fairly fine-toothed zipper with tooth spacing of about 2mm or 1/12in, the human genome would run about from Auckland to Hawai’i. On this scale, 1 Morgan is about 230km, so you inherit contiguous genome chunks from your grandparents of about 60km. The HLA region is about 6km long. A million-variant SNPchip has markers spaced several meters apart. Bayesian surprise http://notstatschat.netlify.com/2017/09/22/bayesian-surprise/ Fri, 22 Sep 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/09/22/bayesian-surprise/ For reasons not entirely unconnected with NZ election polling, I’ve been thinking about surprise in Bayesian inference again: what happens when you get a result that’s a long way from what you expected in advance? Yes, your prior is badly calibrated and you should feel bad, but what should you believe? A toy version of the problem is inference for a location parameter. We have a prior $p_\theta(\theta)$ for the parameter, and a model $p_X(x|\theta)$. Visual design of diagnostics http://notstatschat.netlify.com/2017/09/06/visual-design-of-diagnostics/ Wed, 06 Sep 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/09/06/visual-design-of-diagnostics/ Q: Are these curves parallel? A: I mean, probably not? They look like they might be getting closer together, but if those big steps mean more uncertainty… Q: Ok, how about with confidence intervals. Now are they parallel? A: Um. I’m not sure that helped. Still a definite maybe Q: Is this curve horizontal? A: No. It slopes down. It crosses zero somewhere around 8 or 9 years. Causes and counterfactuals http://notstatschat.netlify.com/2017/08/23/causes-and-counterfactuals/ Wed, 23 Aug 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/08/23/causes-and-counterfactuals/ Attention Conservation Notice: this was on StatsChat four years ago, but I like it as a causation example. A story in the Herald illustrates a subtle technical and philosophical point about causation. A Lotto winner says “I realised I was starving, so stopped to grab a bacon and egg sandwich. “When I saw they had a Lotto kiosk, I decided to buy our Lotto tickets while I was there. Wilcoxon and polymath: another update http://notstatschat.netlify.com/2017/08/19/wilcoxon-and-polymath-another-update/ Sat, 19 Aug 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/08/19/wilcoxon-and-polymath-another-update/ As I wrote before, there’s a polymath (large-scale collaborative pure maths) project on transitivity of dice. Here’s the latest update from Timothy Gowers’s blog. Suppose $X$, $Y$, and $Z$ are discrete distributions supported on $1,2,\\dots,n$. We can ask about $P(X<Y)$ and $P(Y<Z)$ and $P(X<Z)$, which is what the Wilcoxon/Mann-Whitney rank test does. The project has basically proved that under one model for randomly choosing distributions, if $X$, $Y$, and $Z$ have the same mean and $P(X>Y)>1/2$ and $P(Y>Z)>1/2$, the probability of $P(X>Z)>1/2$ is $1/2+o(1)$. The bus bot http://notstatschat.netlify.com/2017/08/10/the-bus-bot/ Thu, 10 Aug 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/08/10/the-bus-bot/ Back in January, I spent a few hours hacking together a script to tweet summaries of the Auckland bus system, on the account @tuureiti. People seemed to like it: the bot has 110 followers, many of whom appear to be actual people (or at least actual organisations). A few times I’ve been asked for the source code and hadn’t gotten around to it, because the code is ugly, includes my API key, and is ugly. Psychoactive substances and Peter Dunne http://notstatschat.netlify.com/2017/07/26/psychoactive-substances-and-peter-dunne/ Wed, 26 Jul 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/07/26/psychoactive-substances-and-peter-dunne/ New Zealand, like a lot of places, has a problem with illegal sales of potent synthetic cannabinoid receptor agonists (aka ‘synthetic cannabis’, ‘synthetic marijuana’, ‘Spice’, ‘K2′, etc, etc). Peter Dunne, as the responsible Minister, is getting a lot of criticism. I don’t think Peter Dunne should be an MP. His party got 0.22% of the vote at the last election. In theory it’s conceivable he got in because he provides astonishingly good constituency service to Ohariu rather than as an edge case in the MMP voting system, but I find that hard to believe. Tail bounds under sparse correlation http://notstatschat.netlify.com/2017/07/26/tail-bounds-under-sparse-correlation/ Wed, 26 Jul 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/07/26/tail-bounds-under-sparse-correlation/ Attention Conservation Notice: Very long and involves a proof that hasn’t been published, though the paper was rejected for unrelated reasons. Basically everything in statistics is a sum, and the basic useful fact about sums is the Law of Large Numbers: the sum is close to its expected value. Sometimes you need more, and there are lots of uses for a good bound on the probability of medium to large deviations from the expected value. Information and control http://notstatschat.netlify.com/2017/07/25/information-and-control/ Tue, 25 Jul 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/07/25/information-and-control/ There were delays on the Auckland rail system this morning, apparently due to a train hitting a person in south Auckland. It seems unreasonable to complain about the delays; Auckland Transport doesn’t have a warehouse of magic inflatable replacement trains, and owing to historic underfunding of trains, there isn’t a lot of redundancy in the physical track network. They actually did a pretty good job of moving around the trains they have, and I was only delayed about twenty minutes. Probabilities not bounded away from zero http://notstatschat.netlify.com/2017/07/09/probabilities-not-bounded-away-from-zero/ Sun, 09 Jul 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/07/09/probabilities-not-bounded-away-from-zero/ We have a population or cohort of $N$ people divided into $H$ sampling strata, with a sample of size $n_h$ taken from the population $N_h$ in stratum $h$. Let $\pi_{ij}$ be the sampling probability for person $i$ in stratum $h$. When we do asymptotics we usually assume $\pi_{ih}$ are bounded away from zero. That’s not ideal for, say, case-control studies of rare diseases, where we might want asymptotic approximations based on the case incidence being small (ie, converging to zero). Two-day course: survival analysis http://notstatschat.netlify.com/2017/07/05/two-day-course-survival-analysis/ Wed, 05 Jul 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/07/05/two-day-course-survival-analysis/ Tuesday 12 and Wednesday 13 September 2017, 9am-5pm. This two-day workshop will cover data exploration, data summaries, and regression modelling for time-to-event data. There will be both lecture and practical sessions. Topics: Concepts: censoring, truncation, competing risks, choice of time scale Summaries: the Kaplan–Meier curve; mean, median, and proportion surviving; the hazard rate; graphical exploration Two-sample testing: the logrank test and its strengths and weaknesses The proportional hazards model: right censoring, left truncation, Time-varying predictors A possibly unsurprising bootstrap observation http://notstatschat.netlify.com/2017/06/11/a-possibly-unsurprising-bootstrap-observation/ Sun, 11 Jun 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/06/11/a-possibly-unsurprising-bootstrap-observation/ Suppose you have a finite population modelled as a realisation of some probability model with potentially complicated spatial structure, and a multistage sample taken with some different structure. For example, suppose you have a genetic linear mixed model with ancestry and relatedness structure, but you sample people by census block group and household. It is either blindingly obvious or really surprising (or both?) that the sampling component of the standard error doesn’t depend on the structure of the model. Stupid word games http://notstatschat.netlify.com/2017/06/05/stupid-word-games/ Mon, 05 Jun 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/06/05/stupid-word-games/ Today, Jeroen Ooms announced the appearance on CRAN of an R package for language detection, wrapping the “CLD2″ compact language detector. Obviously, given a tool like that on a holiday long weekend, my first reaction was to try to confuse it. Two fun games to play with a language detector: Find an obviously English sentence (ideally a quote) that it doesn’t recognise as English, and a very non-obviously English sentence that it does Pipeable survey analysis in R http://notstatschat.netlify.com/2017/05/29/pipeable-survey-analysis-in-r/ Mon, 29 May 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/05/29/pipeable-survey-analysis-in-r/ Today, I accidentally found out about the ‘srvyr’ package, which is a wrapper for my ‘survey’ package to make it work with %>% pipes and dplyr and so on. Yay! R has a package discovery problem. I wouldn’t say I’m the most plugged-in of R users, but there must be a reasonable fraction who would be even less likely than me to find out about it. Even though the ‘survey’ package design sticks fairly close to ‘tidy data’ principles, the fact that it uses different conventions from the `tidyverse’ packages means that there’s a whole lot of adaptor code needed. Peer review and community endorsement http://notstatschat.netlify.com/2017/05/22/peer-review-and-community-endorsement/ Mon, 22 May 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/05/22/peer-review-and-community-endorsement/ In most academic fields there are journals it’s easy to publish in. Some of these are outright scams, but some are just not that fussy about the importance of results. In the experimental sciences, being able to publish negative or otherwise uninteresting results can be very important. Even in fields where ideas, rather than data, are important, being able to get research out into libraries is valuable – though preprint servers such as arXiv are now filling that niche. A ‘polymath’ project on the Wilcoxon test? http://notstatschat.netlify.com/2017/05/12/a-polymath-project-on-the-wilcoxon-test/ Fri, 12 May 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/05/12/a-polymath-project-on-the-wilcoxon-test/ `Polymath’ is a set of projects in massive collaborative proof of mathematical results; Terry Tao and Timothy Gowers are two of the famous mathematicians involved. There’s a new potential project on Gowers’s blog, which he describes a being related to intransitive dice. As you know, if you read this blog, (a) I prefer the term non-transitive, and (b) this means it’s about the Wilcoxon test. The idea of the conjecture is that you define an $n$-sided die by sampling uniformly with replacement $n$ numbers from $1, 2,3,\dots,n$ as the numbers on the sides, with the constraint that the numbers have to add up to $n(n+1)/2$. Value of a degree http://notstatschat.netlify.com/2017/05/01/value-of-a-degree/ Mon, 01 May 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/05/01/value-of-a-degree/ Today was graduation day for Science students at the University of Auckland. At each graduation, the Chancellor of the University gives an introduction that includes (for example, here) We know that, compared to those whose formal education ends in high school, graduates have lower unemployment rates, higher salaries, better career prospects, and better health outcomes. I’d hope that a university degree would give students the tools to think about claims like that. Prerequisites http://notstatschat.netlify.com/2017/03/29/prerequisites/ Wed, 29 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/29/prerequisites/ This week, John Myles White tweeted One meme I wish would die off: the belief that we can teach high school students statistics without teaching them calculus. Statistics Twitter was immediately divided between “Preach it, brother!” and “Not cool, dude.” I’m mostly, but not entirely, in the latter camp. Personally, I did study calculus before taking up statistics, and it helped. In fact, I studied tensor calculus, functional analysis, measure theory, group theory, and differential topology before taking up statistics. Come work with us http://notstatschat.netlify.com/2017/03/28/come-work-with-us/ Tue, 28 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/28/come-work-with-us/ The Statistics Department at the University of Auckland is looking for three new academics. We have two entry-level positions[1], and one mid-level to senior position[2]. Formal ad here: (https://www.flickr.com/photos/yaranaika/5612799116) The department has a fairly broad view of what statistics is about: we have probabiliists (both theoretical and applied), researchers in mainstream statistical methods, in astrostatistics, in genomics, in stats education, in statistical ecology, in forensic statistics, and (famously) in statistical computing. Flat Earthers http://notstatschat.netlify.com/2017/03/27/flat-earthers/ Mon, 27 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/27/flat-earthers/ The world isn’t a flat rectangle. We’ve got to the stage where most people accept this. It’s especially easy in New Zealand, where we know you can fly in a wide variety of directions and still end up in Europe after about the same time in the air. Since the world isn’t a flat rectangle, all flat rectangular maps have to be badly wrong somehow. Recently, Boston public schools have shifted from the badly-wrong Mercator projection to the differently-wrong Gall-Peters projection. Why I like the Convolution Theorem http://notstatschat.netlify.com/2017/03/27/why-i-like-the-convolution-theorem/ Mon, 27 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/27/why-i-like-the-convolution-theorem/ The convolution theorem (or theorems: it has versions that some people would call distinct species and other would describe as mere subspecies) is another almost obviously almost true result, this time about asymptotic efficiency. It’s an asymptotic version of the Cramér–Rao bound. Suppose $\hat\theta$ is an efficient estimator of $\theta$ and $\tilde\theta$ is another, not fully efficient, estimator. The convolution theorem says that if you rule out stupid exceptions, asymptotically $\tilde\theta=\hat\theta+e$ where $e$ is pure noise, independent of $\hat\theta$. Case-control efficiency http://notstatschat.netlify.com/2017/03/18/case-control-efficiency/ Sat, 18 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/18/case-control-efficiency/ The basic story about sampling weights and regression is fairly simple: if you don’t need the weights, using them will add noise. The standard error increase is basically proportional to the coefficient of variation of the weights, and doesn’t depend on the regression coefficients or the covariate distribution. Logistic regression in a case-control sample looks superficially as if it should be the same. The maximum likelihood estimator is unweighted logistic regression, ignoring the weights, and it’s more efficient that the estimator using sampling weights. Order and quotient topologies http://notstatschat.netlify.com/2017/03/14/order-and-quotient-topologies/ Tue, 14 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/14/order-and-quotient-topologies/ Over the years when I was intermittently working on the rock-paper-scissors (transitivity) problem in statistical testing, one of the confusing things was the difference between order and quotient topologies. I thought I’d write about why. Suppose you have two-dimensional Euclidean space, with points $(x,y)$, and you decide to order points on the first coordinate, so $(x,y)\prec(z,w)$ iff $x<z$. This gives you equivalence classes $(x,y)\sim(z,w)$ iff $x=z$. There are two obvious topologies on the set of equivalence classes. “Meritocracy” and “public good” http://notstatschat.netlify.com/2017/03/11/meritocracy-and-public-good/ Sat, 11 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/11/meritocracy-and-public-good/ Sometimes a word coined with one intended meaning ends up with a very different one, and after you have fought the good fight and run the race to the finish, you need to just give up. My favourite example is “meritocracy”, a word coined like “truthiness” and “factoid” to satirically attack a social trend. It failed completely: while “truthiness” worked, and “factoid” still has some of its negative connotation, “meritocracy” now means exactly the concept that it attacked: the supposed ideal of stack-ranking based on a straightforward one-dimensional metric for merit. Hearing things http://notstatschat.netlify.com/2017/03/05/hearing-things/ Sun, 05 Mar 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/03/05/hearing-things/ I spend more time than I like on aeroplanes, so I thought I’d write something about my experiences with headphones. 1. Having something to cut out the engine noise makes a noticeable difference to how much air travel sucks. Maybe as much as 10%. 2. When I first started to teach R courses for money, in 2001, I bought a pair of Bose noise-cancelling headphones. These make a big difference, and it’s easy to listen to music with them on. The Ihaka Lectures http://notstatschat.netlify.com/2017/02/02/the-ihaka-lectures/ Thu, 02 Feb 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/02/02/the-ihaka-lectures/ The Stats department at the University of Auckland is inaugurating a public lecture series, named to honour Ross Ihaka, who is planning to retire this year. We’re having four lectures, with speakers chosen to represent a wide range of areas where statistical computing and graphics is important. Wednesday, March 8: Hadley Wickham (Chief Scientist, RStudio; (honorary) Associate Professor, University of Auckland). Hadley did an MSc in Statistics here in Auckland and a PhD with Di Cook’s statistical graphics group at Iowa State University. When the bootstrap doesn’t work http://notstatschat.netlify.com/2017/02/01/when-the-bootstrap-doesnt-work/ Wed, 01 Feb 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/02/01/when-the-bootstrap-doesnt-work/ The bootstrap always works, except sometimes. By ‘works’ here, I mean in the weakest senses that the large-sample bootstrap variance correctly estimates the variance of the statistic, or that the large-scale percentile bootstrap intervals have their nominal coverage. I don’t mean the stronger sense that someone like Peter Hall might use, that the bootstrap gives higher-order accurate confidence intervals. So the bootstrap ‘works’ for the median, even though not as well as for smooth functions of the mean. Te Reo Māori in schools http://notstatschat.netlify.com/2017/01/31/te-reo-m%C4%81ori-in-schools/ Tue, 31 Jan 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/01/31/te-reo-m%C4%81ori-in-schools/ Having Te Reo Māori taught as part of the standard curriculum in NZ schools seems like a reasonable idea to me. A few reasons: 1. Learning more than one language is good for understanding grammar and pronounciation, and it doesn’t matter a lot which one. “Grammar” is, almost by definition, the set of rules for correct sentences that native speakers follow most of the time without thinking, so it’s hard to talk and think about grammar sensibly if you’ve never tried to produce correct sentences in another language. Case-control sampling and pseudo-Rsquareds http://notstatschat.netlify.com/2017/01/27/case-control-sampling-and-pseudo-rsquareds/ Fri, 27 Jan 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/01/27/case-control-sampling-and-pseudo-rsquareds/ So, I have been asked a few times how to compute $R^2$ for models fitted to survey data. Initially the questions were about the ordinary linear-regression $R^2$, which is easy because it’s the ratio of two variances, and we can estimate variances. More recently, people have been asking about the Nagelkerke pseudo-$R^2$ in logistic regression. It’s not immediately obvious how to define the Nagelkerke $R^2$ under complex sampling. My approach was to consider the Cox–Snell $R^2$ that precedes it, which is an estimate of a well-defined population quantity: $\log (1-R^2)$ is the mutual information between the predictors and outcome. A bus-watching bot http://notstatschat.netlify.com/2017/01/17/a-bus-watching-bot/ Tue, 17 Jan 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/01/17/a-bus-watching-bot/ When it’s up, the account @tuureiti on Twitter tweets a summary of the state of Auckland buses – at the moment, every 15 minutes. Q: Can you explain that picture? A: Every bus that the Auckland Transport GTFS feed knows about has a dot on the graph. The GTFS feed has a ‘delay’ field that says how far ahead or behind schedule the bus is, separated by whether the next event is ‘arrival’ or ‘departure’. Mature and premature optimisation http://notstatschat.netlify.com/2017/01/12/mature-and-premature-optimisation/ Thu, 12 Jan 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/01/12/mature-and-premature-optimisation/ Earlier this week I wrote some code that wasted 90% of its time moving data around in memory, because I just ‘grew’ a long vector with the idiom > stuff<-c(stuff, morestuff) Here’s the github commit that changed the code. I’m writing about it because it illustrates a few useful points. First, the inefficient code was absolutely the right choice initially. I didn’t know how long each additional vector would be, and while I could have worked it out in principle, in practice I would quite likely have got it wrong. Fixing an infelicity in ‘leaps’ http://notstatschat.netlify.com/2017/01/09/fixing-an-infelicity-in-leaps/ Mon, 09 Jan 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/01/09/fixing-an-infelicity-in-leaps/ The leaps package for R is ancient – this is its tenth twentieth year on CRAN. It uses old Fortran code by the Australian computational statistician Alan Miller. The Fortran 90 versions are on the web, but Fortran 90 compilation with R wasn’t portable back then, so I used the older Fortran 77 version. The main point back in 1997 was to provide a version of the leaps() function in S, which uses a branch-and-bound algorithm to do exhaustive search for the best (smallest residual-sum-of-squares) model of each size. Learning the Monty Hall problem http://notstatschat.netlify.com/2017/01/03/learning-the-monty-hall-problem/ Tue, 03 Jan 2017 00:00:00 +0000 http://notstatschat.netlify.com/2017/01/03/learning-the-monty-hall-problem/ As Wikipedia gives it Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice? The ‘iris’ data http://notstatschat.netlify.com/2016/12/30/the-iris-data/ Fri, 30 Dec 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/12/30/the-iris-data/ Fisher’s famous ‘iris’ data set is a convenient example because it’s small and low-dimensional and has very marked differences between groups. These characteristics also make it a bad example (edit: at least for modern machine learning), because the behaviour of small, low-dimensional classification problems is a very poor guide to the behaviour of large or high-dimensional ones. That’s all obvious. What’s less well known is that the data set is an example of pseudocontext in education. Making survey statistics boring and inefficient http://notstatschat.netlify.com/2016/11/23/making-survey-statistics-boring-and-inefficient/ Wed, 23 Nov 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/11/23/making-survey-statistics-boring-and-inefficient/ Last night, Alastair Scott was awarded the Jones Medal by the Royal Society of New Zealand. The medal, named after Vaughan Jones, is for lifetime achievement in the mathematical sciences. Alastair made contributions to both theoretical and applied statistics in two main areas, as the title of this post indicates. With Jon Rao and others (including me), he worked on making design-based inference boring, and with Chris Wild and others, on making it inefficient. Brief quake summary for overseas people http://notstatschat.netlify.com/2016/11/14/brief-quake-summary-for-overseas-people/ Mon, 14 Nov 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/11/14/brief-quake-summary-for-overseas-people/ There was a pair of big earthquakes in New Zealand last night (late Sunday morning UTC, just after midnight Monday NZ time). They were about half-way between Wellington and Canterbury, in the northeast of the South Island. There have been a lot of smaller related quakes as well. Auckland and Dunedin are unharmed. Christchurch was shaken but not seriously damaged; some people were evacuated because of the potential for a big tsunami. Changes in turnout and preference http://notstatschat.netlify.com/2016/11/10/changes-in-turnout-and-preference/ Thu, 10 Nov 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/11/10/changes-in-turnout-and-preference/ So, as you know, Hillary Clinton narrowly lost the Electoral College and probably narrowly won the popular vote. And there’s lots of theorising about how these huge swings came about and what they mean. An important first step is to think about how big the swings really were. Here are some graphs of county-level votes in 2012 and 2016. In all the graphs, the number of votes for the candidate is scaled by the 2012 total for the county, and is then weighted by that same 2012 total. Cuts to ‘Growing Up in New Zealand’ http://notstatschat.netlify.com/2016/10/18/cuts-to-growing-up-in-new-zealand/ Tue, 18 Oct 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/10/18/cuts-to-growing-up-in-new-zealand/ The NZ cohort study ‘Growing Up in New Zealand’ is being cut from 7000 children to 2000, according to a story on Stuff today. That’s unfortunate – birth cohort studies are something New Zealand has done well in the past, and this is a cohort for the modern New Zealand. Obviously, the top priority for the study will have been to fight the cuts or at least try to moderate them. Terms to eschew http://notstatschat.netlify.com/2016/10/12/terms-to-eschew/ Wed, 12 Oct 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/10/12/terms-to-eschew/ “I have discovered something else,” I continued. “By flipping the pages at random, and putting my finger in and reading the sentences on that page, I can show you what’s the matter – how it’s not science, but memorizing, in every circumstance. Therefore I am brave enough to flip through the pages now, in front of this audience, to put my finger in, to read, and to show you. Large quadratic forms http://notstatschat.netlify.com/2016/09/27/large-quadratic-forms/ Tue, 27 Sep 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/09/27/large-quadratic-forms/ Attention Conservation Notice: there probably aren’t as many as half a dozen groups in the world who actually have this much genome sequencing data. Everyone else could wait to see if something better comes up. If you follow me on Twitter, you will have seen various comments about eigenvalues, matrices, and other linear algebra over the past months. Here, finally, is the Sekrit Eigenvalue Project. A quadratic form in Normally distributed variables is of the form $z^TAz$ where $z$ is a vector of $n$ standard Normals and $A$ is an $n\times n$ matrix. The hard problem of AI and other stories http://notstatschat.netlify.com/2016/09/22/the-hard-problem-of-ai-and-other-stories/ Thu, 22 Sep 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/09/22/the-hard-problem-of-ai-and-other-stories/ Another occasional SF/F post. Amazon now has a lot of older Melissa Scott novels in Kindle format. In the old days, Melissa Scott was known for forthrightly LBGTQ fiction. After a decade or two, there’s been enough social progress for that to not be the most obvious thing about her writing. The ‘hard problem of consciousness’ is a term of art in philosophy of mind. It’s either the most important question about intelligence, or a purely linguistic distraction from the real issues. Come work with us http://notstatschat.netlify.com/2016/09/07/come-work-with-us/ Wed, 07 Sep 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/09/07/come-work-with-us/ The Statistics department at the University of Auckland is advertising for a Professional Teaching Fellow, following the retirement of existing staff. This is a full-time, permanent, academic staff position in a department that understands how much its success depends on high-quality teaching. Aerial view of Auckland by Flickr user Craig, annotated to show Stats Dept The formal ad is on Seek. The department is seeking to appoint a highly organised, energetic and collegial person for the role of Professional Teaching Fellow. On permuting all the things http://notstatschat.netlify.com/2016/09/06/on-permuting-all-the-things/ Tue, 06 Sep 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/09/06/on-permuting-all-the-things/ I wanted to list all the numbers whose digits were some permutation of 2,2,5,5,9,9, and find how many of them were multiples of 11, and how many were prime. (Because of Evelyn Lamb’s comment on the prime number 295259 produced by the prime numbers twitter bot) It takes some thought to work out how to list those numbers exactly once (because of the duplicated digits) but no thought at all to work out how to generate a random sample of them and discard duplicates. The lithium-powered space bike http://notstatschat.netlify.com/2016/09/04/the-lithium-powered-space-bike/ Sun, 04 Sep 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/09/04/the-lithium-powered-space-bike/ Q: So, it’s been about 11 months since you got your fancy electric-assist bike A: Yes, that’s right Q: Have you given up yet? A: No, it’s still fun. Q: Even with the rain? A: Combining Doppler radar and the detailed weather forecasts has mostly kept me dry Q: And getting killed by cars? A: So far, still at less than 1 event. Q: How do you feel about busy two-lane roundabouts? “The” multiple comparisons problem http://notstatschat.netlify.com/2016/08/27/the-multiple-comparisons-problem/ Sat, 27 Aug 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/08/27/the-multiple-comparisons-problem/ Andrew Gelman posted recently with the title “Bayesian inference completely solves the multiple comparisons problem”. Bayesians have been making a claim that sounds like this for many years, so it would be easy to misunderstand and think he was making a much weaker claim than he actually is. There are at least two multiple comparisons problems, andI’d like to suggest some terminology: The first-person multiple comparisons problem: I have data relevant to a collection of parameters $\{\theta_i\}_{i=1}^N$ and I want to make sure I arrive at sensible beliefs or take sensible decisions even if $N$ is quite large Like a crossword http://notstatschat.netlify.com/2016/08/20/like-a-crossword/ Sat, 20 Aug 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/08/20/like-a-crossword/ The philosopher of science Susan Haack has a lovely analogy for the interconnectedness of scientific ideas: the crossword puzzle. We’re talking something along the lines of the New York Times crossword, not a British-style cryptic: the clues for each entry are often insufficient taken one at a time, but a false answer is likely to be revealed by its failure to fit with crossing answers. Chris McDowall recently reminded me of the Phantom Time Hypothesis, my favourite engagingly batshit historical theory. Simulations and modes of convergence http://notstatschat.netlify.com/2016/08/14/simulations-and-modes-of-convergence/ Sun, 14 Aug 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/08/14/simulations-and-modes-of-convergence/ We often have theory that says \[\sqrt{n}(\hat\theta_n-\theta)\stackrel{d}{\to}N(0,\sigma^2),\] and then do simulations to see how well the asymptotic approximation applies. After doing so, we often present tables of the empirical mean and standard deviation of $\hat\theta_n.$ This doesn’t make a lot of sense. Knowing that $\sqrt{n}(\hat\theta_n-\theta)\stackrel{d}{\to}N(0,\sigma^2)$ doesn’t tell us anything about the moments of $\hat\theta_n$ for any finite $n$. Convergence in distribution does not imply convergence in mean. For example, $\hat\theta_n$ could be maximum likelihood estimates in a logistic regression model. Etymology http://notstatschat.netlify.com/2016/08/02/etymology/ Tue, 02 Aug 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/08/02/etymology/ Penguin: the name is supposed to come from the Welsh pen gwyn, meaning ‘white head’. Since penguins have black heads, and do not live within 10,000 km of Wales, it is difficult to see how this theory arose. A modest proposal: Lazy Ambiguous Single Transferable Vote http://notstatschat.netlify.com/2016/07/29/a-modest-proposal-lazy-ambiguous-single-transferable-vote/ Fri, 29 Jul 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/07/29/a-modest-proposal-lazy-ambiguous-single-transferable-vote/ We’re about to have another outbreak of voting here in NZ as well. The local government elections use STV, and Graeme Edgeler explains it here. In particular, he explains how indicating preferences for all the candidates, even ones you don’t want to win, is desirable. Because Twitter is Twitter, a discussion of this came up with Rob Salmond’s proposal that you should be able to vote 1,2, 3, , 35,36 for, say, the District Health Board elections where there are a few good candidates who are worth voting for, a couple of antifluoride or antivax extremists who need to be voted against, and a lot of boring and irrelevant candidates. One scoRe years http://notstatschat.netlify.com/2016/07/28/one-score-years/ Thu, 28 Jul 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/07/28/one-score-years/ It’s always nice when even imperfect metrics make you look good. The new programming-language rankings in IEEE Spectrum are out. I don’t think I believe their weighting system, but it has R in 5th place! Since last year, we’ve just edged out C#. Bob Muenchen looks at statistical software citations on Google Scholar, and finds that R is narrowly in front of SAS – and though SPSS is well ahead, it’s headed down. How do we prove the Central Limit Theorem? http://notstatschat.netlify.com/2016/07/04/how-do-we-prove-the-central-limit-theorem/ Mon, 04 Jul 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/07/04/how-do-we-prove-the-central-limit-theorem/ More precisely, in a course in mathematical statistics that’s trying not to assume more mathematics than necessary, how do we prove it? A (Weak) Law of Large Numbers is easy: Markov’s inequality, then Chebyshev’s Inequality, not needing anything more than the simplest manipulation of expectations. The CLT is hard. The standard approach is to use characteristic functions: prove Levy’s Continuity Theorem, work out what the characteristic function of an iid sum looks like, and then work out the characteristic function of a Normal. Computing the (simplest) sandwich estimator incrementally http://notstatschat.netlify.com/2016/06/04/computing-the-simplest-sandwich-estimator-incrementally/ Sat, 04 Jun 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/06/04/computing-the-simplest-sandwich-estimator-incrementally/ The biglm package in R does {incremental, online, streaming} linear regression for data potentially larger than memory. This isn’t rocket science: accumulating $X^TX$ and $X^TY$ is trivial; the package just goes one step better than this by using Alan Miller’s incremental $QR$ decomposition code to reduce rounding error in ill-conditioned problems. The code also computes the Huber/White heteroscedasticity-consistent variance estimator (sandwich estimator). Someone wants a reference for this. There isn’t one, because it’s too minor to publish, and I didn’t have a blog ten years ago. Are there any news? http://notstatschat.netlify.com/2016/06/03/are-there-any-news/ Fri, 03 Jun 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/06/03/are-there-any-news/ I’ve written before about treating ‘data’ as a plural count noun or a mass noun. In most settings I’m happy with either: ‘this data’ or ‘these data’; ‘data is’ or ‘data are’. There are settings where the count version doesn’t feel right to me. I might write “We don’t have much data on that issue”, but never “We don’t have many data on that issue”. Perhaps the most extreme is the opposite of ‘more data’: ‘fewer data’ just seems wrong. Size matters http://notstatschat.netlify.com/2016/04/14/size-matters/ Thu, 14 Apr 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/04/14/size-matters/ There’s a lovely demonstration of simple neural networks at playground.tensorflow.org, which I recommend to anyone interested in teaching or studying them. It shows the inputs, the hidden nodes, and the output classification, and how they change with training. You can add more neurons or more layers interactively, and fiddle with the training parameters. I wish something like this had been available in the early 90s when I was learning about neural networks. Sufficiently advanced technology http://notstatschat.netlify.com/2016/04/10/sufficiently-advanced-technology/ Sun, 10 Apr 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/04/10/sufficiently-advanced-technology/ As I said on Twitter, I found out this week (1) that there are cheap variable resistors sensitive to acetone (or ethanol) and (2) that many people, even scientists, don’t think this is amazing. Multi-atom molecules are hugely bigger than electrons, but hugely smaller than the bulk semiconductor. They shouldn’t be able to affect resistance. Other technologies for interfacing chemistry and electronics tend to be based on light absorption (breathalysers, blood oxygen detectors, some DNA sequences) or on electrons/protons released in chemical reactions (other breathalysers, other DNA sequencers) or in past days on formation of ions in solution. The Great Kiwi Cherry Ripe Scandal http://notstatschat.netlify.com/2016/03/29/the-great-kiwi-cherry-ripe-scandal/ Tue, 29 Mar 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/03/29/the-great-kiwi-cherry-ripe-scandal/ In which I unnecessarily calculate a simple probability by maths when I’ve already done it by simulation. You can just see it on a maths teaching blog as a Bad Example A company is making packs of eight chocolate bars chosen independently and with equal probability from five types: “Cherry Ripe”,“Dairy Milk”,“Crunchie”, “Caramello”, and “Flake”. What is the probability that a pack will contain seven or more Cherry Ripes? Mostly dead http://notstatschat.netlify.com/2016/03/28/mostly-dead/ Mon, 28 Mar 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/03/28/mostly-dead/ Inigo Montoya: He’s dead. He can’t talk. Miracle Max: Whoo-hoo-hoo, look who knows so much. It just so happens that your friend here is only MOSTLY dead. There’s a big difference between mostly dead and all dead. Mostly dead is slightly alive. In the cardiovascular-research trade, there’s a minor but persistent issue of nomenclature. When your heart stops beating and you fall over dead, should that be called “Sudden Cardiac Death” or “Sudden Cardiac Arrest”? Artistic verisimilitude http://notstatschat.netlify.com/2016/03/24/artistic-verisimilitude/ Thu, 24 Mar 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/03/24/artistic-verisimilitude/ Scene: the back room of a gas station not far from a Research 1 university, late twentieth century. “Chris? You got a minute” It’s Chris’s ‘lunch’ break, which by natural justice and state law should be the time for daydreaming about attractive members of the appropriate sex while trying to start an assignment on Banach spaces. “The boss read an in-flight magazine again” Chris sighs. Having the owner away for a week was good, but he always seemed to get…ideas The conservative Bonferroni correction http://notstatschat.netlify.com/2016/03/20/the-conservative-bonferroni-correction/ Sun, 20 Mar 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/03/20/the-conservative-bonferroni-correction/ It seems to be a surprise to most people (certainly to me) how sharp the Bonferroni correction is when the number of tests is large. Unless the correlation between tests is really, high, the actual family-wise Type I error rate is very close to the nominal rate $\alpha/k$. Part of the issue is confusing prior distributions on effect sizes (which can be quite strongly correlated) with null sampling distributions (which tend to be weakly correlated in the extreme tails). Trace estimators and impact factors http://notstatschat.netlify.com/2016/03/15/trace-estimators-and-impact-factors/ Tue, 15 Mar 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/03/15/trace-estimators-and-impact-factors/ For a Secret Project™, I needed a quick estimator of the trace of a matrix. To be precise, I have a rectangular matrix $A$ and I needed $\mathop{tr}(B)$ and $\mathop{tr}(B^2)$ where $B=A^TA$. That sounds easy, but $A$ is big enough that you don’t want to compute $A^TA$. The first one actually is easy: \[\mathop{tr}(B)=\sum_{ij}(A_{ij})^2.\] The second one is harder. I tried a sampling approach: estimating a sample of the entries of $B$ and using \[\mathop{tr}(B^2)=\sum_{ij} (B_{ij})^2. A gene for celibacy? http://notstatschat.netlify.com/2016/03/13/a-gene-for-celibacy/ Sun, 13 Mar 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/03/13/a-gene-for-celibacy/ Tyson is, unusually for him, completely wrong here. I’ll ignore the use of “gene” to mean “allele”, since that’s a plain-English abuse of notation as harmless as calling Pluto a planet. Put more precisely, he’s saying “if you have a genetic variant (allele) that substantially increases your tendency to celibacy, you didn’t inherit it” The first problem is a statistical one. It would be surprising if you inherited a ‘celibacy gene’, but it would also be surprising if you got it by de novo mutation. Truthy and Sciency http://notstatschat.netlify.com/2016/03/02/truthy-and-sciency/ Wed, 02 Mar 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/03/02/truthy-and-sciency/ There was a story at the New Zealand Herald, republished without attribution from TheConversation, under the headline “People in their 90s reveal secret to ageing well”. By and large, it’s pretty good example of what The Conversation is trying to do, but there are some strange bits, such as Regular exercise changes our epigenome, activating genes that improve muscle function and Few participants smoked, avoiding the known epigenetic effects of cigarette smoke including lung damage, increased risk of dementia and cancer. Coding linear splines http://notstatschat.netlify.com/2016/02/29/coding-linear-splines/ Mon, 29 Feb 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/02/29/coding-linear-splines/ Attention conservation notice: anyone who would actually use this could just sit down and do the algebra almost as quickly. The best-known splines are cubic: a cubic spline with knots $x_1,\;x_2,\dots,\;x_m$ is a piecewise-cubic polynomial $f(x)$ where $f$, $f’$, and $f’’$ are continuous at the knots. The name is from the engineer’s drafting tool, a flexible metal strip that – in the infinitely-thin, uniformly flexible asymptote – will form a curve held down at the knots and otherwise minimising bending energy $\int f’’(x)^2\,dx$ to give a cubic spline. Cheap tricks http://notstatschat.netlify.com/2016/02/28/cheap-tricks/ Sun, 28 Feb 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/02/28/cheap-tricks/ If you’re interested in thinking about evidence and belief and rhetoric, it’s convenient to have uncontroversial examples of ‘cheap tricks’ that directly affect attitudes. Ian Gordon has provided one, with his translation of the ‘Imperial March’ from Star Wars into a major key. The tune is recognisably still film music by John Williams in a military style, but it’s happy, and lively, and on the edge of self-parody: somewhere between ‘The Great Escape’ and ‘Chicken Run’. Two cheers for crowdfunding http://notstatschat.netlify.com/2016/02/26/two-cheers-for-crowdfunding/ Fri, 26 Feb 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/02/26/two-cheers-for-crowdfunding/ A successful large crowdfunding effort in a medium-sized community is, ipso facto, widely popular. If roughly 40,000 people have donated to buy a beach and sandbar, they’re going to be proud of themselves and not want the effort criticised. And it’s hard to argue that the two million dollars spent on Awaroa beach is a worse use of money than, say, the twenty million spent in a typical week on the lottery. No-one’s forcing you to read the Herald http://notstatschat.netlify.com/2016/02/07/no-ones-forcing-you-to-read-the-herald/ Sun, 07 Feb 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/02/07/no-ones-forcing-you-to-read-the-herald/ “No-one’s forcing you to read the Herald” (various sources on the internet) It’s true. No-one is forcing me to read the NZ Herald. In fact, I went forty years without reading it and no-one criticised at all. No-one forced me to move to New Zealand, either. You probably wouldn’t want the Herald to be your only source of news, but for someone living, working, and voting in Auckland, the Herald is the best available paper. Stochastic SVD http://notstatschat.netlify.com/2016/02/05/stochastic-svd/ Fri, 05 Feb 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/02/05/stochastic-svd/ Suppose you have an $m\times n$ matrix $A$ of rank $k$. If $\Omega$ is an $n\times k$ matrix with iid standard Gaussian entries, then $\Omega$ will have rank $k$ with probability 1, $A\Omega$ will have rank $k$ with probability 1, and so $A\Omega$ spans the range of $A$. That’s all easy. More impressively, if $A=\tilde A+\epsilon$ where $\tilde A$ has rank $k$ and $\epsilon$ has small norm, and if $\Omega$ has $k+p$ columns, $A\Omega$ spans the range of $\tilde A$ with high probability, for surprisingly small values of $p$. Is it that time of day? http://notstatschat.netlify.com/2016/01/20/is-it-that-time-of-day/ Wed, 20 Jan 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/01/20/is-it-that-time-of-day/ Wade at Minding Data wrote about a local NZ radio station One of the main criticisms of The Rock, is that even if it doesn’t play the same song between 9 – 5, it still plays the same song everyday, often at the same time. To be fair to them, it’s probably no different to the criticism hurled at any popular radio station really. Anecdotal, I used to listen to the radio as I was getting up in the morning, and I used to swear that for weeks on end, I would be getting up to the same song. Another view of the ‘nearly true’ model http://notstatschat.netlify.com/2016/01/13/another-view-of-the-nearly-true-model/ Wed, 13 Jan 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/01/13/another-view-of-the-nearly-true-model/ Ok, so to recap, we have a large model (such as ‘we know the marginal sampling probabilities’) and a small model (such as the subset of the large model with $\mathrm{logit}\,P[Y=1]=x\beta$). Under the large model, we would use the estimator $\hat\beta_{L}$, but under the small model there is a more efficient estimator $\hat\beta_S$. That is, under the small model \[\sqrt{n}(\hat\beta_S-\beta_0)\stackrel{d}{\to}N(0,\sigma^2)\] and \[\sqrt{n}(\hat\beta_L-\beta_0)\stackrel{d}{\to}N(0,\sigma^2+\omega^2)\] We’re worried that the small model might be slightly misspecified. What does ‘design-consistent’ even mean? http://notstatschat.netlify.com/2016/01/13/what-does-design-consistent-even-mean/ Wed, 13 Jan 2016 00:00:00 +0000 http://notstatschat.netlify.com/2016/01/13/what-does-design-consistent-even-mean/ In classical survey statistics you have a fixed finite population of size $N$ and a (possibly unequal-probability, multistage) sample of size $n$. Useful asymptotics requires an infinite sequence of populations and samples chosen so that approximation errors from neglecting terms that decrease in $n$ and $N$ are practically unimportant in the real data when they are asymptotically negligible in the infinite sequence. For ‘model consistency’ this is easy. An estimator $\hat\theta_n$ is model consistent if $\hat\theta_n\stackrel{p}{\to}\theta_0$ when the population of size $N$ is a sample from a model $P_\theta$ with parameter $\theta=\theta_0$, for all designs obeying regularity conditions to be described in the proof. Circumspice http://notstatschat.netlify.com/2015/12/31/circumspice/ Thu, 31 Dec 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/12/31/circumspice/ Norman Breslow died early this month. If you’ve had any involvement with medical statistics you have used his work. There isn’t really any need to expound on his contributions. I have a few Norm memories. In my first quarter at the University of Washington, I took BIOST 570 (generalised linear models) from Norm. One day, about halfway through the quarter, he appeared with a copy of ‘Science’ and asked me why I hadn’t been a co-author on a paper from the Sydney Blood Bank. Superfood sourcing http://notstatschat.netlify.com/2015/12/30/superfood-sourcing/ Wed, 30 Dec 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/12/30/superfood-sourcing/ Because reasons, I ended up looking at a website for a new superfood, “The Hawaiian Coffeeberry ®”. ““The Hawaiian Coffeeberry ®”, of course, isn’t a Hawaiian plant – it’s coffee, orginally from Ethiopia – but at least the marketing is Hawaiian. Well, the parts that aren’t unsourced copying from various internet sites. Here’s a description of what they think the dominant nutrients are: The chlorogenic acid bit is ok, though they don’t mention it’s also found in, eg, peaches, and eggplant, and potatoes. The Muntab Question Strikes Back http://notstatschat.netlify.com/2015/12/24/the-muntab-question-strikes-back/ Thu, 24 Dec 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/12/24/the-muntab-question-strikes-back/ Public Policy Polling, which is known for adding questions in surveys to exploit Republicans who are less informed, recently found that 30% of Republican voters would support bombing Agrabah, a fictional country in the Disney film Aladdin. On December 20, 2015, WPA Research fielded a national survey of 1,132 registered voters that found 44% of Democrats would support ….. As you know, I think the Agrabah bombing question was misleading and unhelpful to public political discourse. Potential energy and kinetic energy http://notstatschat.netlify.com/2015/12/22/potential-energy-and-kinetic-energy/ Tue, 22 Dec 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/12/22/potential-energy-and-kinetic-energy/ Q: So, I hear you have a new bike? A: Yes, it’s electric. Q: One you don’t need to pedal? A: No, you do still need to pedal, but the motor helps Q: Doesn’t that defeat the purpose of cycling? A: Well, that rather depends on what you think the purpose of cycling is. Q: Do you want to expound? A: Why, yes, thank you! Cycling is a great way to travel moderate distances on flat ground. Case-control estimation is more complicated than you think http://notstatschat.netlify.com/2015/12/20/case-control-estimation-is-more-complicated-than-you-think/ Sun, 20 Dec 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/12/20/case-control-estimation-is-more-complicated-than-you-think/ Well, obviously I don’t know how complicated you think it is, but it’s more complicated than I thought, and more complicated than my colleagues thought. In a case-control design you sample all the cases ($Y=1$) and a fraction $\pi_0$ of the controls ($Y=0$) from a cohort. You could fit a logistic regression with sampling weights ($1$ for cases, $1/\pi_0$ for controls). Or, you can fit an unweighted logistic regression, which should be biased except that all the bias ends up in the intercept term and doesn’t affect the regression coefficients. A simple probability problem http://notstatschat.netlify.com/2015/12/14/a-simple-probability-problem/ Mon, 14 Dec 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/12/14/a-simple-probability-problem/ Amy Hogan, a stats and maths teacher who blogs at A Little Stats, posted the following quiz on twitter: (Assuming fair dice) which has the highest probability: 1 six from 6 dice 2 sixes from 12 dice 3 sixes from 18 dice The calculations aren’t too hard even by hand, and we have pbinom() available (if we remember to check $<$ vs $\le$ conditions). In that sense the question is easy, but I was looking for an intuitive argument. The Muntab Question http://notstatschat.netlify.com/2015/12/14/the-muntab-question/ Mon, 14 Dec 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/12/14/the-muntab-question/ In a survey by Public Policy Polling, 30% of Republican-leaning and 20% of Democratic-leaning people said they supported bombing the fictional country of Agrabah. I’ve written on StatsChat why I think these are deliberately misleading percentages and asking the question amounts to pissing in the swimming pool of public discourse. What I want to say here is that the “Don’t Know”s aren’t getting nearly enough stick. Now, some of the “Don’t Knows” will have been successfully trolled by PPP and will think Agrabah is actually the name of somewhere in Syria or Iraq where bombing is a genuine question, and a subset of these will legitimately not have given enough attention to the question to be sure. Serious tongue-twister http://notstatschat.netlify.com/2015/11/27/serious-tongue-twister/ Fri, 27 Nov 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/11/27/serious-tongue-twister/ This is a flow-chart I put together, that may go in the documentation for the Health Research Council data monitoring committee. Obviously the questions in boxes are simplified and would need expansion in the text. Most of the work of the data monitoring committee, and the work that the study does on our behalf, is concentrated at the twice-yearly meetings. Some things, though, can’t safely be left to accumulate for six months, so there’s urgent reporting of sufficiently noteworthy clinical events to a clinician on the data monitoring committee. Poetry visualisation http://notstatschat.netlify.com/2015/11/14/poetry-visualisation/ Sat, 14 Nov 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/11/14/poetry-visualisation/ So, I was trying to write something serious last night about communicating probabilities for my talk to journalists next weekend, but it was a depressing day, so instead I did something frivolous that was vaguely related. Wisława Szymborska won the 1996 Nobel Prize for Literature. I first encountered this poem when mathematician Evelyn Lamb linked to it at JoAnne Growney’s blog Poetry with Mathematics. Should SPRINT have stopped? http://notstatschat.netlify.com/2015/11/10/should-sprint-have-stopped/ Tue, 10 Nov 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/11/10/should-sprint-have-stopped/ The SPRINT trial comparing standard blood pressure treatment (to 140mmHg) with intensive treatment (to 120mmHg) stopped early in September and just published this week. Hilda Bastian wrote about the problems of early stopping back in September. Now the results are out, we can see a lot more detail on what was going on. I think the right decision was made, but it’s not completely straightforward. Also, I’m only a simple country statistician, so I may be missing some issues. Prefiltering very large numbers of tests http://notstatschat.netlify.com/2015/10/19/prefiltering-very-large-numbers-of-tests/ Mon, 19 Oct 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/10/19/prefiltering-very-large-numbers-of-tests/ Genome-wide association studies involve lots of analyses. Nearly always they involve lots of tests. Also, in contrast to gene expression studies or to state-specific estimates of political attitudes or small-area disease rate estimates, a lot of the null hypotheses are effectively true. That is, most single-nucleotide polymorphisms are so close to not having any effect on anything that we might as well call it zero. Most people express this in terms of the need for stringent Type I error control; Bayesians like Matthew Stephens might talk in terms of the very low prior probability of a non-negligible effect. Double robustness http://notstatschat.netlify.com/2015/10/18/double-robustness/ Sun, 18 Oct 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/10/18/double-robustness/ “Double robust” estimation in a regression problem uses a model for the outcome $Y$ given available data $Z$ and a model for the exposure $X$ given available data $Z$. The estimates are consistent if either model is correct and efficient if they both are correct[1]. Described that way, double robustness doesn’t sound very useful. “All models are wrong; many models are useless”, as we can deduce from Box’s familiar aphorism, so the chance of one of two models being correct is no more than twice the chance of one model being correct: two times $4/5$ of $5/8$ of not very much[2]. Convergent evolution and NZ Bird of the Year http://notstatschat.netlify.com/2015/10/05/convergent-evolution-and-nz-bird-of-the-year/ Mon, 05 Oct 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/10/05/convergent-evolution-and-nz-bird-of-the-year/ Forest & Bird is the New Zealand equivalent of the UK’s Royal Society for the Protection of Birds, or the USA’s Audubon Society. Each year, they hold a “Bird of the Year” competition, to get more publicity for NZ birds. The competition is made possible by the relatively small number of bird species in New Zealand, partly because it’s an isolated set of islands, and partly because a depressing number of the birds are ex-species. NZ Flag Referendum pseudorandom numbers http://notstatschat.netlify.com/2015/09/22/nz-flag-referendum-pseudorandom-numbers/ Tue, 22 Sep 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/09/22/nz-flag-referendum-pseudorandom-numbers/ The counting process for the NZ Flag Referendum needs some way to break ties. The Act defines a way to generate pseudo-random numbers (Schedule 4, clauses 14 to 22). Anyone in computational statistics who reads this will recognise some of the magic numbers; for the rest of you, here’s what’s going on. The Act almost defines the Wichmann-Hill PRNG, a respectable, if old-fashioned algorithm that was the original RNG in R. Oranges and lemons http://notstatschat.netlify.com/2015/09/21/oranges-and-lemons/ Mon, 21 Sep 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/09/21/oranges-and-lemons/ One of the basic principles of applied statistics is that the data don’t tell you what the question is. For example, the distribution of a variable doesn’t tell you what summary statistic you are interested in. For mean vs median, a good example is binary variables. If a variable (like the indicator variable for dying in a car crash) is 0 for most people and 1 for a few people, the variable is very highly skewed but the mean (probability) is a much more useful summary statistic than the median (zero). (high-dimensional) Space is Big. http://notstatschat.netlify.com/2015/09/14/high-dimensional-space-is-big./ Mon, 14 Sep 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/09/14/high-dimensional-space-is-big./ Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Hitchhikers Guide to the Galaxy There’s a simple simulation that I used in stat computing class last year and for Rob Hyndman’s working group last week. Simulate data uniformly on a $p$-dimensional hypercube $[0,\,1]^p$ and compute nearest-neighbour distances. Good reasons for assuming a spherical cow http://notstatschat.netlify.com/2015/09/14/good-reasons-for-assuming-a-spherical-cow/ Mon, 14 Sep 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/09/14/good-reasons-for-assuming-a-spherical-cow/ Talks and papers in statistics often have what purports to be an application but with assumptions that look implausible. That can be fine, but you need to know, and tell us, why you’re making those assumptions. If I ask “Why are you assuming a spherical cow?”, here are some possible good answers: Honest theory: “It’s not really about cows. Greeble’s Conjecture is the leading open question in heterotrophic morphon theory. Net Reclassification Index: surprisingly weird. http://notstatschat.netlify.com/2015/08/29/net-reclassification-index-surprisingly-weird./ Sat, 29 Aug 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/08/29/net-reclassification-index-surprisingly-weird./ Attention Conservation Notice: Long. Really long. No, longer than that. Here: read the original instead. The Net Reclassification Index (NRI) is a summary of improvement in prediction when new information is added, and an intuitively plausible one. Suppose that we’re trying to predict $Y=1$ vs $Y=0$, and that for person $i$ we have an old predicted probability $\hat p_{\textrm{old}}(i)$ and a new predicted probability $\hat p_{\textrm{new}}(i)$. We’d hope that the probabilities for cases ($Y=1$) go up and the probabilities for controls ($Y=0$) go down when more information is used. A conservation tragedy http://notstatschat.netlify.com/2015/08/20/a-conservation-tragedy/ Thu, 20 Aug 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/08/20/a-conservation-tragedy/ The NZ Herald is reporting that hunters taking part in a pūkeko cull on one of the islands near Auckland killed four takahē. Pūkeko (Porphyrio porphyrio) are the closest relatives of takahē (Porphyrio hochstetteri), but they aren’t all that close. Takahē are New Zealand natives, which were forced out of their wetland habitat to alpine grasslands by the Māori, and then nearly wiped out by the stoats and red deer introduced by Europeans. Colour names from XKCD in R http://notstatschat.netlify.com/2015/08/20/colour-names-from-xkcd-in-r/ Thu, 20 Aug 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/08/20/colour-names-from-xkcd-in-r/ Randall Munroe at XKCD did a color names survey a few years ago, and published a list of about a thousand colour names whose RGB values (averaged across his readers’ monitors) could be fairly reliably estimated. I have finally got around to turning them into an R package. It’s only on GitHub so far. The functions are xcolors(max_rank=-1): List the top (most commonly given) max_rank color names; analogous to colors() Fox fails statistics; does NYT? http://notstatschat.netlify.com/2015/08/06/fox-fails-statistics-does-nyt/ Thu, 06 Aug 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/08/06/fox-fails-statistics-does-nyt/ From Fox Each poll has a different margin of error, and averaging requires a distinct test of statistical significance. Given the over 2,400 interviews contained within the five polls, from a purely statistical perspective it is at least 90% likely that the tenth place Kasich is ahead the eleventh place Perry. The Upshot blog at the New York Times correctly points out that if this is a p-value, that’s not the way to interpret it. JSM2015: notes on Seattle from an ex-resident http://notstatschat.netlify.com/2015/08/05/jsm2015-notes-on-seattle-from-an-ex-resident/ Wed, 05 Aug 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/08/05/jsm2015-notes-on-seattle-from-an-ex-resident/ Getting from the airport: Public transport. No question. Well, unless you have significant mobility problems, in which case why are you looking at travel advice from random strangers on the internet? Take the light rail to the end of the line (Westlake), then catch any bus from the same platform one stop to Convention Place. The alternatives are much more expensive, and have a fair chance of being slower. Pianos, heaps, and ethics of randomisation http://notstatschat.netlify.com/2015/08/01/pianos-heaps-and-ethics-of-randomisation/ Sat, 01 Aug 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/08/01/pianos-heaps-and-ethics-of-randomisation/ Suppose you could make the following observations: Ling (零) was a pianist Every pianist has a favourite student Different pianists have different favourite students Ling was not the favourite student of any pianist1 Anything that Ling knew and that every pianist teaches to his favourite student ends up known by everyone in the Ling School of Piano (it’s like martial arts) If the first three observations continue to be true, the Ling School of Piano will obviously go on forever. Te Wiki o Te Reo Māori http://notstatschat.netlify.com/2015/07/27/te-wiki-o-te-reo-m%C4%81ori/ Mon, 27 Jul 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/07/27/te-wiki-o-te-reo-m%C4%81ori/ Scene: A wetland by high-country stream, Te Wai Pounamu. Rangers from Te Papa Atawhai pick up a bedraggled hunter. “How did you get up here? Where’s your boat” “Not a boat. Over there.” He gestures “Yeah? So why are you out here in the swamp?” “Can’t go back.” “What’s the matter, bro?” asks the bigger ranger, who’s obviously been chosen as the good cop. “They got out of the net,” the hunter said, showing his bleeding hand is missing a finger stringsAsFactors = <sigh> http://notstatschat.netlify.com/2015/07/25/stringsasfactors-sigh/ Sat, 25 Jul 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/07/25/stringsasfactors-sigh/ Problems with R can be divided into several groups: R has the defects of its virtues: pass-by-value and deep copying make the language easy to learn, but waste a lot of memory. R is old: it’s not written in C++ and doesn’t have a 64-bit integer type because those weren’t things in 1992 Base R (and S before it) was developed for interactive use and then got extended into computational infrastructure. Pi day http://notstatschat.netlify.com/2015/07/02/pi-day/ Thu, 02 Jul 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/07/02/pi-day/ Pi day is celebrated on March 14 in all the countries that use the MM/DD/YYYY date format (ie, the USA). Pi Approximation day is celebrated in the rest of the world on July 22. I’m proposing today for another one: π continued fraction day. Like the 22/7 festival it doesn’t depend on using base 10, and like American Pi day it is extensible when the stars align correctly. The continued fraction expansion of π is A much-needed gap http://notstatschat.netlify.com/2015/06/20/a-much-needed-gap/ Sat, 20 Jun 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/06/20/a-much-needed-gap/ There are a surprisingly large number of research papers that use the Shapiro-Wilk normality test on data from NHANES or the British Household Panel Survey, two large multi-stage surveys. This is a bad idea for multiple reasons Testing for normality is typically a bad idea. It’s unusual for Normal/non-Normal to be an interesting question. That’s in contrast to testing for a power law in skewed data, where apparently many people are interested in the question, though fewer of them in how to answer it. Countermatching http://notstatschat.netlify.com/2015/06/03/countermatching/ Wed, 03 Jun 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/06/03/countermatching/ Countermatching is a simple case-control sampling mechanism that makes people uncomfortable when they first encounter it. Get ready. Suppose you want to study the effect of a relatively rare exposure (sufficiently high dose radiation to the heart) on a relatively rare outcome (heart failure in breast cancer survivors). If you just took a random sample of the population there would be very few breast cancer survivors, so you work with a cohort of breast-cancer survivors. Zero-inflated Poisson from complex samples http://notstatschat.netlify.com/2015/05/26/zero-inflated-poisson-from-complex-samples/ Tue, 26 May 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/05/26/zero-inflated-poisson-from-complex-samples/ A very long post about how to add models to the survey package; specifically, the zero-inflated Poisson. The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if $Z$ is Bernoulli with probability $1-p_0$ and $P$ is Poisson with mean $\lambda$ then \[Y=Z+(1-Z)P\] is zero-inflated Poisson. The ZIP is a latent-class model; we can have $Y=0$ either because $Z=0$ or because $P=0$. Call me, Ishmael http://notstatschat.netlify.com/2015/05/20/call-me-ishmael/ Wed, 20 May 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/05/20/call-me-ishmael/ Making small changes in text to escape plagiarism-detection software is challenging but not really difficult. The relationship between your text and the urtext in the software is precise and syntactic; the software doesn’t know about ideas. Making small changes in a text that change or obscure the meaning is also easy, and is why copyeditors exist. Making small changes in a text that give an interesting new meaning is enormously harder, as in the opening of Peter de Vries’ “The Vale of Laughter”, quoted as the title of this post. Superefficiency http://notstatschat.netlify.com/2015/05/12/superefficiency/ Tue, 12 May 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/05/12/superefficiency/ If you have $X_1,\ldots,X_n$ independent from an $N(\mu,1)$ distribution you don’t have to think too hard to work out that $\bar X_n$, the sample mean, is the right estimator of $\mu$ (unless you have quite detailed prior knowledge). As people who have taken an advanced course in mathematical statistics will know, there is a famous estimator that appears to do better. Hodges’ estimator is given by $H_n=\bar X_n$ if $|\bar X_n|>n^{-1/4}$, and $H_n=0$ if $|\bar X_n|\leq n^{-1/4}$. Precise answers, but not necessarily to the right question http://notstatschat.netlify.com/2015/05/04/precise-answers-but-not-necessarily-to-the-right-question/ Mon, 04 May 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/05/04/precise-answers-but-not-necessarily-to-the-right-question/ Nicholas Schork has a commentary at Nature about precision medicine, arguing in favour of n-of-1 trials. These are the extreme version of crossover trials: you randomise each individual to a long sequence of periods on each of two treatments and see which they do better on. The idea makes sense: you get genuinely individual-specific results for people in the study, and the ability to aggregate them to generalise to people not in the study. What’s the right proof of the Continuous Mapping Theorem? http://notstatschat.netlify.com/2015/05/03/whats-the-right-proof-of-the-continuous-mapping-theorem/ Sun, 03 May 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/05/03/whats-the-right-proof-of-the-continuous-mapping-theorem/ The Continuous Mapping Theorem says that if $X_n\stackrel{d}{\to}X$ and $f$ is continuous except at a set of points with zero probability under $X$, that $f(X_n)\stackrel{d}{\to}f(X)$. As David Pollard points out, it should be called the almost-everywhere-continuous mapping theorem, because the ability to have discontinuities is important in applications and is the only thing making the proof non-trivial. There are three proofs that I’m aware of Mann and Wald used the ‘pointwise convergence of cdfs’ definition of convergence in distribution, which gives a painful proof Eppur si muove http://notstatschat.netlify.com/2015/04/02/eppur-si-muove/ Thu, 02 Apr 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/04/02/eppur-si-muove/ “The rules for winning the science competition focus on a small number of measures that incentivize poor practice” Hilda Bastian quoting Ottoline Leyser. It’s all true, and more and worse besides. Researchers are driven by the incentives for high-impact publication; p-value hacking makes results seem more convincing than they are; trials use surrogate outcomes; glamour journals publish insufficiently-checked linkbait; predatory online journals will do anything for money; change and decay in all around we see. Pharmacy ethics http://notstatschat.netlify.com/2015/03/29/pharmacy-ethics/ Sun, 29 Mar 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/03/29/pharmacy-ethics/ ‘You have heard that it was said to those of ancient times, “You shall not murder”; and “whoever murders shall be liable to judgement.” But I say to you that if you are angry with a brother or sister, you will be liable to judgement. Matthew 5:21-22 In practice, we have to distinguish. Whoever murders is liable to judgement, but being angry isn’t enough. In the same way, formal codes of professional ethics come in two versions: the aspirational code that describes the way we want the profession to be, and the legalistic code that describes what will get you kicked out. Paper helicopters at a science fair http://notstatschat.netlify.com/2015/03/28/paper-helicopters-at-a-science-fair/ Sat, 28 Mar 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/03/28/paper-helicopters-at-a-science-fair/ Today, we ran Box’s paper helicopter experimental design example at the Science Street Fair sponsored by the NZ Association of Scientists and hosted by the Museum of Transport and Technology. It went fairly well. In particular, the younger kids really liked dropping paper helicopters and comparing different designs and we got in a few useful discussions of experimental design with adults – mostly school teachers. Things to note: Use a photocopier and pre-printed design template, such as the one from the SixSigma package for R This lets you also produce a 2-up version, giving an extra interesting design factor. What does measurability mean? http://notstatschat.netlify.com/2015/03/07/what-does-measurability-mean/ Sat, 07 Mar 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/03/07/what-does-measurability-mean/ Attention conservation notice: A long, meandering, and inconclusive attempt to explain why you perhaps shouldn’t worry about a technical issue you almost certainly weren’t worrying about already. Mathematical proofs in statistics are, in some formal sense, useless. That is, they formally have conditions such as finite moments, boundedness, differentiability or stochastic equicontinuity that either apply to all things in the real world or to none. The proofs are also often formally about infinite sequences; these don’t crop up all that often in data analysis. How hard did you look: equivalence and non-inferiority http://notstatschat.netlify.com/2015/02/27/how-hard-did-you-look-equivalence-and-non-inferiority/ Fri, 27 Feb 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/02/27/how-hard-did-you-look-equivalence-and-non-inferiority/ I usually don’t read nutripharma articles outside the mainstream media, but someone tweeted a link about saffron, which apparently cures everything. The last straw was a line beginning “Saffron, a major component of the Mediterranean diet…” Saffron can’t really be described as a major component of anything, even risotto milanese, and it’s not unique to the Mediterranean region: it’s a well-known spice in India, Pakistan, Iran. And it’s not just the well-known places: England produced saffron before Italy produced tomatoes. Clinically proven ingredients http://notstatschat.netlify.com/2015/02/26/clinically-proven-ingredients/ Thu, 26 Feb 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/02/26/clinically-proven-ingredients/ NZ golf prodigy Lydia Ko has a sponsorship deal with a company that sells special jetlag-reducing water. She obviously knows how this sort of thing works, and what she said to the Herald was nicely crafted Ms Ko said she was excited to have the support of 1Above. “I haven’t really taken it for a long-haul flight before but I’ve seen some of the results and everything that comes with it and I have heard great things,” she said. Science and statistical inference http://notstatschat.netlify.com/2015/02/17/science-and-statistical-inference/ Tue, 17 Feb 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/02/17/science-and-statistical-inference/ Q: There have been a lot of papers recently about the spike in p-values just below 0.05, haven’t there? A: Yes. A bit depressing. But there’s a new analysis that says it’s ok. Q: Really? That’s great! A: Yes, Daniel Lakens shows that you can explain the recent increase in just-significant p-values, and that “data does not provide any indication of an increase in questionable research practices” Q: And is his modelling correct? Assumptions and testing http://notstatschat.netlify.com/2015/01/15/assumptions-and-testing/ Thu, 15 Jan 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/01/15/assumptions-and-testing/ My attention was drawn on Twitter to an old (1999) paper in The American Statistician, “Different Outcomes of the Wilcoxon-Mann-Whitney Test from Different Statistics Packages.” The authors looked at 11 statistics packages and found they didn’t always give the same result for the Wilcoxon/Mann-Whitney test. The big problem was handling of tied observations. Here are their example data: The authors say “It is obvious that the data resulting from the experiment could not be analyzed by the Student’s t-test. A transitive test is a test for a univariate parameter http://notstatschat.netlify.com/2015/01/14/a-transitive-test-is-a-test-for-a-univariate-parameter/ Wed, 14 Jan 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/01/14/a-transitive-test-is-a-test-for-a-univariate-parameter/ As you know, rank tests can be non-transitive: they can have the rock-paper-scissors property. Tests that are for a single real-valued summary statistic (eg a test comparing means or medians or variance) are always transitive, because they are just comparing a single number, and ordering on numbers is transitive. The converse is almost obviously almost true: if you have a transitive test, it almost has to be a test for a single real-valued summary statistic. New header picture http://notstatschat.netlify.com/2015/01/12/new-header-picture/ Mon, 12 Jan 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/01/12/new-header-picture/ As you can see, there’s a new header picture to replace the generic tumblr theme. It’s a pair of Superb Fairywrens, aka blue wrens (Malurus cyaneus), one of my favourite Melbourne birds. They’re now quite common along the Melbourne coastline from St Kilda down along the bay. For people in the Northern hemisphere: these are completely unrelated to the wrens you are familiar with, and are also unrelated to the New Zealand wrens. Tomato, tomato http://notstatschat.netlify.com/2015/01/12/tomato-tomato/ Mon, 12 Jan 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/01/12/tomato-tomato/ There are two* great commandments for conference session chairs You shall adhere to the schedule with all your heart and all your mind and all your strength You shall pronounce the speakers’ names approximately as they do themselves For show-biz award ceremonies the first commandment doesn’t apply, but the second still does. In order to pronounce someone’s name correctly, you need to ask for the correct pronunciation and have some way of remembering it, such as writing it down phonetically. Different questions can have different answers http://notstatschat.netlify.com/2015/01/11/different-questions-can-have-different-answers/ Sun, 11 Jan 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/01/11/different-questions-can-have-different-answers/ The Slate Money podcast [1] had an item on Manhattan apartment prices. The mean price last quarter was $1.7 million and the median was \$0.98 million. Firstly, that’s a lot of money. Secondly, the mean is a lot bigger than the median. The real point, though, is that the mean is a record, up on the previous peak (in 2008) by $120,000. The median is down from 2008, by $15,000. Variation explained and log transformation http://notstatschat.netlify.com/2015/01/03/variation-explained-and-log-transformation/ Sat, 03 Jan 2015 00:00:00 +0000 http://notstatschat.netlify.com/2015/01/03/variation-explained-and-log-transformation/ This post is technical details for one at StatsChat on the Johns Hopkins “two-thirds of cancer is bad luck” paper. I don’t have any real opinions on the conclusion: it’s clear that unforced errors in DNA copying will cause some cancers, and it’s not obvious how many. The technical problem with the paper (or at least with its publicity) is that the ‘proportion of variation explained’ was estimated for log risk and quoted as “two-thirds of cancers are due to bad luck’. How not to treat Ebola http://notstatschat.netlify.com/2014/12/23/how-not-to-treat-ebola/ Tue, 23 Dec 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/12/23/how-not-to-treat-ebola/ From the Guardian, via Mark Henderson of the Wellcome Trust Ebola patients at a treatment centre in Sierra Leone have been given a heart drug that is untested against the virus in animals and humans, a move that has been deemed reckless by one senior scientist and has prompted UK medical staff at the centre to leave. Ebola is a problem for drug testing. You don’t want to leave people untreated, but you do want to find out as fast as possible what works. Citations: credit or blame http://notstatschat.netlify.com/2014/12/14/citations-credit-or-blame/ Sun, 14 Dec 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/12/14/citations-credit-or-blame/ Katie Hinde at ‘Mammals Suck’ writes Only cite papers that you have read! DO NOT cite papers based on another publication’s report of them. Because every time that happens, a science fairy dies. That’s an excellent principle. So why do I have a paper in press that cites a paper I haven’t read? There are two reasons to cite a paper: as evidence for a claim, or to give credit to the authors for their research. What science should everyone know? http://notstatschat.netlify.com/2014/12/08/what-science-should-everyone-know/ Mon, 08 Dec 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/12/08/what-science-should-everyone-know/ In response to the question “How much science knowledge should the average person have or should we just encourage people to ask questions?” @NaomiShadbolt @petergnz Basics: Atoms; Evolution; “the lights in the sky are suns”; Randomisation; Conservation laws. And ask questions. — Thomas Lumley (@tslumley) December 8, 2014 Expanding on this: Atoms: everything is made of a very large but not infinite number of definite, basically indivisible, pieces, and there are very few different types (about 100). It depends on what you mean by 'cost' http://notstatschat.netlify.com/2014/11/30/it-depends-on-what-you-mean-by-cost/ Sun, 30 Nov 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/11/30/it-depends-on-what-you-mean-by-cost/ The Tufts Center for the Study of Drug Development has a new cost estimate out: Cost to Develop and Win Marketing Approval for a New Drug Is $2.6 Billion. The figure is probably fairly accurate as an estimate of what it’s trying to estimate, but it gets quoted in other contexts, so I think it’s worth looking at the number a piece at a time. The Tufts researchers haven’t provided enough information to do this, so I’m relying on estimates from Bruce Booth (who also has a spreadsheet that you can use for sensitivity analyses). This is just to say http://notstatschat.netlify.com/2014/11/06/this-is-just-to-say/ Thu, 06 Nov 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/11/06/this-is-just-to-say/ The plums, which you stored there on ice, I have eaten; they went in a trice. If you meant them to last For a morning repast Then I’m sorry, but boy were they nice. or Some say the plums will end in tarts Some say on ice From what I’ve eaten ’round these parts I hold with those who favor tarts But if they had to vanish first I think I know enough of guilt A people set apart http://notstatschat.netlify.com/2014/11/05/a-people-set-apart/ Wed, 05 Nov 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/11/05/a-people-set-apart/ There’s a conference coming up at Deakin University in Melbourne, on energy drinks. The unusual aspect of the conference is that no-one who has received industry funding is welcome. Obviously the energy drink industry aren’t happy about this. I couldn’t give a fsck about their hurt feelings, but I hope this sort of policy doesn’t spread. Now, I’m not completely naïve about the sorts of things some industry groups will do when there’s a lot of money at stake. Miasma and Contagion http://notstatschat.netlify.com/2014/10/25/miasma-and-contagion/ Sat, 25 Oct 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/10/25/miasma-and-contagion/ Scientists have a nasty habit of taking ordinary English words, turning them into technical terms, and then insisting that the ordinary use is Just Wrong. ‘Organic’, which I’ve written about before, is a good example. On the other hand, sometimes the scientists are right. I complained on Twitter last night about the phrase ‘meningococcal virus’ in a Herald opinion piece on state housing, and I have previously complained about the ‘Psa virus’ for the bacterium Pseudomonas syringae pv. Semiparametric efficiency and nearly-true models http://notstatschat.netlify.com/2014/10/25/semiparametric-efficiency-and-nearly-true-models/ Sat, 25 Oct 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/10/25/semiparametric-efficiency-and-nearly-true-models/ Suppose you have $N$ people with some variables measured, and you choose a subset of $n$ to measure additional variables. I’m going to assume the probability $\pi_i$ that you measure the additional variables on person $i$ is known, so it has to be a setting where non-response isn’t an issue – eg, choosing which frozen blood samples to analyse, or which free-text questionnaire responses to code, or which medical records to pull for abstraction. Broman's Socks and the Nature of Scientific Reporting http://notstatschat.netlify.com/2014/10/20/bromans-socks-and-the-nature-of-scientific-reporting/ Mon, 20 Oct 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/10/20/bromans-socks-and-the-nature-of-scientific-reporting/ Rasmus Bååth wrote a post using Approximate Bayesian Computation to estimate a posterior distribution for Karl’s socks. What he didn’t consider was the impact of publication bias. In order for us to see the tweet, it was not only necessary that Karl’s first 11 socks were distinct, it was also necessary that he found this remarkable, and, probably, that no-one he follows on Twitter had made a similar laundry-related observation at any recent time. Is it good or bad when confounding adjustment makes no difference? http://notstatschat.netlify.com/2014/09/24/is-it-good-or-bad-when-confounding-adjustment-makes-no-difference/ Wed, 24 Sep 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/09/24/is-it-good-or-bad-when-confounding-adjustment-makes-no-difference/ There’s a new paper out in J Epi Community Health, looking at the relationship between perceived job insecurity and incident asthma. NHS ‘Behind the Headlines’ covers it well. One of the interesting things about the paper is that the crude relative risk between above/below 50% estimated risk of losing your job is 1.61, and the relative risks after adjustment in three increasingly-complex models are 1.58, 1.62, and 1.61. That is, the adjustment for confounding has no impact at all. On dialect http://notstatschat.netlify.com/2014/08/29/on-dialect/ Fri, 29 Aug 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/08/29/on-dialect/ In New Zealand, ‘radiata’ and ‘macrocarpa’ are accepted common names for two widely planted non-native conifers: Pinus radiata and Cupressus macrocarpa, known in their native US as ‘Monterey pine’ and ‘Monterey cypress’ respectively. It’s unusual for the specific epithet of a plant to become the common name. There are plenty of examples of the generic name becoming the common name, from ‘bougainvillea’ to ‘wisteria’. There are even plenty of examples where a former generic name has stuck as the common name after the botanists have renamed the plant to, eg, Pelargonium, Hippeastrum, or Corymbia. Rhetorical sensitivity analysis http://notstatschat.netlify.com/2014/08/29/rhetorical-sensitivity-analysis/ Fri, 29 Aug 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/08/29/rhetorical-sensitivity-analysis/ Rhetorical sensitivity analysis “The ethanol in alcohol is a group one carcinogen, like asbestos,” Prof. Doug Sellman, Otago University (July 2013) Professor Sellman is correct, of course. What’s more, alcohol is even an important cause of cancer. From the viewpoint of rhetoric and risk communication it’s still interesting to see how the effect of the sentence changes when other familiar IARC Group I carcinogens are substituted for ‘asbestos’ alcohol is a group one carcinogen, like sunlight alcohol is a group one carcinogen, like birth-control pills alcohol is a group one carcinogen, like plutonium alcohol is a group one carcinogen, like tobacco alcohol is a group one carcinogen, like arsenic, alcohol is a group one carcinogen, like wood dust None of these really has the quite same rhetorical impact; the only one that comes close is ‘tobacco’. O necessary sinpi http://notstatschat.netlify.com/2014/08/27/o-necessary-sinpi/ Wed, 27 Aug 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/08/27/o-necessary-sinpi/ The R help page for sin, cos, and tan, mentions functions sinpi, cospi, tanpi, “accurate for x which are multiples of a half.” This struck someone I know as strange. I’ve been thinking about this sort of thing recently while teaching Stat Computing, so here’s some background. If you’re a mathematician, $\sin x$ is given by a power series \[\sin x = x - \frac{x^3}{3!}+\frac{x^5}{5!} -\frac{x^7}{7!} +-\cdots\] This series converges for all $x$, and so converges uniformly on any finite interval. Taking meta-analysis heterogeneity seriously http://notstatschat.netlify.com/2014/08/24/taking-meta-analysis-heterogeneity-seriously/ Sun, 24 Aug 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/08/24/taking-meta-analysis-heterogeneity-seriously/ In fixed-effects meta-analysis of a set of trials the goal is to find a weighted average of the true treatment effects in those trials (whatever they might be). The results are summarised by the weighted average and a confidence interval reflecting its sampling uncertainty. In random-effects meta-analysis the trials are modelled as an exchangeable sample, implying that they can be treated as coming independently from some latent distribution of true treatment effects. Survey package update http://notstatschat.netlify.com/2014/08/15/survey-package-update/ Fri, 15 Aug 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/08/15/survey-package-update/ There’s a new version, 3.30-3, of the ‘survey’ package for R. It’s got quite a lot of new stuff: AIC and BIC for generalised linear models Rank tests for more than two groups Logrank and generalised logrank tests Since I’m known for a lack of enthusiasm about any of these techniques, why are they in the package? Am I just enabling? Well, AIC and BIC are interesting, and I’ll say more below. Feynman and the Suck Fairy http://notstatschat.netlify.com/2014/07/12/feynman-and-the-suck-fairy/ Sat, 12 Jul 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/07/12/feynman-and-the-suck-fairy/ There’s been a bit of…discussion…about Richard Feynman recently. In one Twitter conversation, Richard Easther said he had been thinking of using Feynman’s commencement address “Cargo Cult Science” with a first-year physics class, and had decided against. I was a bit surprised. It’s been a long time since I read that piece, but I couldn’t remember anything objectionable in it. So I re-read it. It’s still really good in a lot of ways. Herd Immunity simulations http://notstatschat.netlify.com/2014/06/01/herd-immunity-simulations/ Sun, 01 Jun 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/06/01/herd-immunity-simulations/ Especially for vaccines that are not 100% effective, a large chunk of the benefit comes from ‘herd immunity’, the fact that incomplete vaccination makes it harder for an epidemic to get started and spread. Increasing the proportion of people vaccinated helps those people, and it also helps the people who aren’t vaccinated. Here’s a set of simulations (code, needs FNN package and R) that show the effect. There is a simulated population of 10,000 people living on a square (actually, a doughnut, since it wraps around). Monotonicity and smoothness http://notstatschat.netlify.com/2014/05/22/monotonicity-and-smoothness/ Thu, 22 May 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/05/22/monotonicity-and-smoothness/ Andrew Gelman has an interesting discussion of monotonicity as a modelling constraint. I basically agree with what he says, but since my first real statistical research (my M.Sc. thesis) was on order restrictions I thought I’d write about a related aspect of the problem. Assuming that a relationship is monotone sounds like a very strong assumption, and therefore one that you’d expect to gain a lot by making. Asymptotically, this isn’t true. Anchoring bias http://notstatschat.netlify.com/2014/05/18/anchoring-bias/ Sun, 18 May 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/05/18/anchoring-bias/ Anchoring bias: high school students asked to add up the digits in their phone number and to estimate how many countries there are in Africa. (phew, it worked) (I did delete one data point as non-responsive: estimated number of countries in Africa was 1) (with adults I’d use last two digits of phone number, but with teenage girls I thought a bit more information-hiding was appropriate) Randomisation without consent http://notstatschat.netlify.com/2014/05/14/randomisation-without-consent/ Wed, 14 May 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/05/14/randomisation-without-consent/ The issue of randomisation without consent has come up in New Zealand. Because I’m on the HRC Data Monitoring Core Committee, which monitors some NZ clinical trials I don’t want to say much about any current NZ clinical trials, even ones we’re not monitoring. I do want to talk about the principle. The always-useful NZ Science Media Centre has rounded up a couple of bioethicists on the topic, and you should read what they say. Einstein, Wikiquote, and fact checking http://notstatschat.netlify.com/2014/03/14/einstein-wikiquote-and-fact-checking/ Fri, 14 Mar 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/03/14/einstein-wikiquote-and-fact-checking/ It’s not only Pi Day in the USA (3/14, they write dates backwards), it’s Einstein’s 135th birthday. Einstein, like Mark Twain, 孔夫子, Churchill, Disraeli, and the Chinese proverbs, is a quote magnet. He said many quotable things, and even more are attributed to him. The NZ Herald has a list of ten Einstein quotes. Annoyingly, none of them say where or when they were said. So I did the absolutely minimal level of fact checking. My likelihood depends on your frequency properties http://notstatschat.netlify.com/2014/03/04/my-likelihood-depends-on-your-frequency-properties/ Tue, 04 Mar 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/03/04/my-likelihood-depends-on-your-frequency-properties/ The likelihood principle states that given two hypotheses $H_0$ and $H_1$ and data $X$, all the evidence regarding which hypothesis is true is contained in the likelihood ratio \[LR=\frac{P[X|H_1]}{P[X|H_0]}.\] One of the fundamentals of scientific research is the idea of scientific publication, which allows other researchers to form their own conclusions based on your results and those of others. The data available to other researchers, and thus the likelihood on which they rely for inference, depends on your publication behaviour. Chemical nerdview http://notstatschat.netlify.com/2014/02/25/chemical-nerdview/ Tue, 25 Feb 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/02/25/chemical-nerdview/ One of Stephen J. Gould’s essays contains the admission I confess that I have always been greatly amused by the term primate, used in its ecclesiastical sense as “an archbishop … holding the first place among the bishops of a province.” My merriment must be shared by all zoologists, for primates, to us, are monkeys and apes–members of the order Primates. … But this amusement is silly, parochial, and misguided. This is a wug. Now you have two of them. http://notstatschat.netlify.com/2014/02/09/this-is-a-wug.-now-you-have-two-of-them./ Sun, 09 Feb 2014 00:00:00 +0000 http://notstatschat.netlify.com/2014/02/09/this-is-a-wug.-now-you-have-two-of-them./ Three words that used to be plurals, and are changing in three different ways: Candelabra used to be the plural of candelabrum, a multiple-armed candlestick holder. There are very few other English words ending in ‘brum’, and most of the words ending in ‘bra’ are singular (e.g. vertebra, penumbra, cobra, zebra, sabra, bra). Over time, candelabra has been used more and more often as the singular, perhaps most famously in the biographical movie “Behind the Candelabra” about Liberace; the corresponding plural is candelabras. At risk of vanishing http://notstatschat.netlify.com/2013/12/14/at-risk-of-vanishing/ Sat, 14 Dec 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/12/14/at-risk-of-vanishing/ A degree in science, in addition to specific facts about squid, neutrinos, or palladium-catalysed cross-couplings, should teach students what to do with questions about the world. In particular, they should learn to think about what the implications would be of each answer to the question, and know how we might use these implications to rule out some of the answers and reduce our uncertainty about others. A degree in the humanities, in addition to specific facts about tenses in French, resource-allocation procedures in village societies, or the development of the Sangam literature, should teach students what to do with questions about the world. Moving the goalposts? http://notstatschat.netlify.com/2013/11/15/moving-the-goalposts/ Fri, 15 Nov 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/11/15/moving-the-goalposts/ There’s a paper in PNAS suggesting that lots of published scientific associations are likely to be false, and that Bayesian considerations imply a p-value threshold of 0.005 instead of 0.05 would be good. It’s had an impact outside the statistical world, eg, with a post on the blog Ars Technica. The motivation for the PNAS paper is a statistics paper showing how to relate p-values to Bayes Factors in some tests. From labhacks: the $25 scrunchable scientific poster http://notstatschat.netlify.com/2013/11/04/from-labhacks-the-25-scrunchable-scientific-poster/ Mon, 04 Nov 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/11/04/from-labhacks-the-25-scrunchable-scientific-poster/ labhacks: image Printed on Spoonflower performance knit at 300 dpi. 36” x 56”, vivid colors, no unraveling, and minimal wrinkling, even after being stuffed in a backpack. Hangs straight with about 8 pins. Print cost is $22 with $3 shipping. image A diversity of gifts, but the same spirit http://notstatschat.netlify.com/2013/10/30/a-diversity-of-gifts-but-the-same-spirit/ Wed, 30 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/30/a-diversity-of-gifts-but-the-same-spirit/ Peter Green used this line (from I Corinthians) for his Royal Statistical Society Presidential Address in 2003, which anyone interested in the future of statistics should read. I’ve been planning to steal it ever since then, and the time seems right. Roger, Jeff, and Rafa at Simply Statistics are holding an unconference on the future of statistics, some time before dawn tomorrow morning New Zealand time. I probably won’t be attending, but if you’re in a more compatible time zone it promises to be interesting. Interaction: 'real' and statistical http://notstatschat.netlify.com/2013/10/27/interaction-real-and-statistical/ Sun, 27 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/27/interaction-real-and-statistical/ Confounding is a model-independent property of nature: if doing A has a particular effect on Y, it is objectively either true or untrue that the conditional distributions of Y given A and not A match that particular effect. Interaction or effect modification is scale-dependent: you ask “is the effect of A on X in the presence of B the same as the effect of A on X in the absence of B. Barren proxies http://notstatschat.netlify.com/2013/10/20/barren-proxies/ Sun, 20 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/20/barren-proxies/ In causal inference it is often the case that you can’t obtain a confounding variable directly, you can only measure something that it affects. Judea Pearl correctly points out the danger of conditioning on a ‘barren proxy’ for a confounder, in situations like this one: A confounds the effect of B on C. D is affected by A but does not directly affect either B or C, so it is a ‘barren proxy’ for A. Google completions and sexism http://notstatschat.netlify.com/2013/10/18/google-completions-and-sexism/ Fri, 18 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/18/google-completions-and-sexism/ The new ads produced by UN Women illustrating widespread sexism using Google autocomplete are pretty chilling, eg, The ads are convincing and what they imply is true, but I’m less sure that they are actually good evidence for what they imply. Typing whole phrases into Google is not how I or people I’ve watched usually search. I type key words. The only reason I would search for a phrase such as “Women should not speak in church” would be to find the source. Do you know where it's been? http://notstatschat.netlify.com/2013/10/10/do-you-know-where-its-been/ Thu, 10 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/10/do-you-know-where-its-been/ Again this week on the bus I passed the annoying Phoenix Organics delivery van that says Don’t drink science, you don’t know where it’s been [ Phoenix are also notable for their aspartame scare page. ] One of the things they don’t write up as glowingly is that they add the synthetic version of a natural antioxidant to their juices and their (naturally high-fructose) juice drinks, in unnaturally high concentrations. Rock, paper, scissors, Wilcoxon test http://notstatschat.netlify.com/2013/10/06/rock-paper-scissors-wilcoxon-test/ Sun, 06 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/06/rock-paper-scissors-wilcoxon-test/ Based on my nerdnite talk last week. Transitivity is a basic property of orderings: if A is better than B and B is better than C, then A must be better than C. For example, if the All Blacks beat Tonga and Tonga beats Japan, we would expect the All Blacks to beat Japan. Rock-paper-scissors is interesting because it is the opposite: if A beats B and B beats C then A must lose to C. Today we have shaming of prats http://notstatschat.netlify.com/2013/10/06/today-we-have-shaming-of-prats/ Sun, 06 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/06/today-we-have-shaming-of-prats/ Background: Revenge, ego, and corruption in Wikipedia Today we have shaming of prats. Yesterday, we had unclear sources, and tomorrow morning we may have original research. But today, Today we have shaming of prats. The Flame Robin is a small passerine bird native to Australia, and today we have shaming of prats This is the anonymous editor. And this is the revenge edit, whose use you will see Auckland's top news story http://notstatschat.netlify.com/2013/10/03/aucklands-top-news-story/ Thu, 03 Oct 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/10/03/aucklands-top-news-story/ Background: The Herald has had a whole sequence of stories, including two on the front page, about the decision of the city council to stop mowing the ‘berms’, the strips of grass between the road and sidewalk. Additional background: the Mayor of Auckland is one Len Brown. O Lenny Boy, the berms, the berms need mowing From Glen to Lynn, and down Mt Eden side The winter’s gone, and all the grass is growing Statins and the causal Markov property http://notstatschat.netlify.com/2013/09/23/statins-and-the-causal-markov-property/ Mon, 23 Sep 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/09/23/statins-and-the-causal-markov-property/ A real example: there is developing uncertainty that the statin class of cholesterol-lowering drugs really works by lowering LDL cholesterol. This is partly because other drugs (eg, ezetimibe) that lower LDL cholesterol don’t have the same impact on heart attacks, and also because statins seem to have beneficial effects on too many other conditions. In principle, you could control for achieved cholesterol levels and see if statin use was then conditionally independent of heart disease [adjust and see if the effect goes away]. PBRF consultation response consultation http://notstatschat.netlify.com/2013/09/22/pbrf-consultation-response-consultation/ Sun, 22 Sep 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/09/22/pbrf-consultation-response-consultation/ Friday week (October 4) is the deadline for providing input to the consultation on changes in the PBRF process (for foreigners: the national research evaluation program that allocates a chunk of long-term research funding to universities). Here’s the consultation document, if you haven’t read it yet. This is what I’m planning to say. It’s also open for public feedback. Background: I was a member of the PBRF MIST review panel, and also submitted a portfolio. An absolutely minimal way to increase invited speaker diversity http://notstatschat.netlify.com/2013/09/13/an-absolutely-minimal-way-to-increase-invited-speaker-diversity/ Fri, 13 Sep 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/09/13/an-absolutely-minimal-way-to-increase-invited-speaker-diversity/ The low proportion of women among invited speakers at conferences has finally become an issue in biology and computing and science fiction (at least for the people in my Twitter feed). You might worry, if you were running a conference, that having some sort of minimum standard for diversity might lead to suboptimal speakers, or if you had a bunch of small sessions, might be difficult to ensure in each session. What I said on StatsChat only shorter and with more swearing http://notstatschat.netlify.com/2013/09/04/what-i-said-on-statschat-only-shorter-and-with-more-swearing/ Wed, 04 Sep 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/09/04/what-i-said-on-statschat-only-shorter-and-with-more-swearing/ Stealing a Keith Ng title, to do a post motivated by his criticism of my StatsChat post as a ‘generous interpretation’ of Bill English. English said that households with income below $110000 collectively paid no net income tax. This assumes that all benefits are paid solely from income tax, not GST, and even then has to lump together people who receive more in benefits than they pay in income taxes with a lot of people who pay much more in income tax than they receive in benefits. On the persistence of variation in horn size among Soay sheep http://notstatschat.netlify.com/2013/08/23/on-the-persistence-of-variation-in-horn-size-among-soay-sheep/ Fri, 23 Aug 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/08/23/on-the-persistence-of-variation-in-horn-size-among-soay-sheep/ (BBC News) The small-horned rams are fitter, but the big-horned rams are phatter And though we deemed it sweeter To dally with the latter, The small horns still stay with us To Scottish boffins’ wonder; In flocks the nerds still pull the birds When the jocks are six feet under (After Peacock) A layperson's view of a science communication problem http://notstatschat.netlify.com/2013/08/13/a-laypersons-view-of-a-science-communication-problem/ Tue, 13 Aug 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/08/13/a-laypersons-view-of-a-science-communication-problem/ There’s a story in one of the NZ papers saying that Fonterra and the government are completely wrong about the source of the botulism contamination in milk products and about how to fix it. This is a field I know very little about, so it’s interesting to look at the story just from the point of view of an educated consumer. There are some stylistic points that make the story look like it could be bogus: the claim that this one guy is right and everyone else is wrong, the reference to “sitting on material that will embarrass Fonterra further”, blaming the problem on glyphosate (evil Monsanto’s evil Roundup herbicide), the lack of any links or details for the research, and the lack of any independent scientific opinion. SPEED sessions at JSM 2013 http://notstatschat.netlify.com/2013/08/09/speed-sessions-at-jsm-2013/ Fri, 09 Aug 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/08/09/speed-sessions-at-jsm-2013/ This year, the Joint Statistical Meetings introduced a combined poster/short presentation session. The sessions took up a half-day, twice as long as the typical session. In the first half, each presenter gave a 5-minute talk, and the second session was an electronic poster session. I signed up for one of these sessions primarily because I think any form of innovation at the Joint Statistical Meetings should be supported. For people who haven’t experienced the JSM, it’s the largest gathering of statisticians in the world, but it is also characterised by rigid and apparently inexplicable rules (eg, if the chair for your session doesn’t show, you are supposed to find a replacement who isn’t chairing any other session in the whole conference) and a significant minority of astoundingly awful talks. In defense of theory http://notstatschat.netlify.com/2013/08/08/in-defense-of-theory/ Thu, 08 Aug 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/08/08/in-defense-of-theory/ The statistical community, and even more so the statistical curriculum, hasn’t yet adapted fully to the improvements in computing over the past few decades, and so still gives too much priority to mathematical approaches and too little to computational approaches to many problems. That’s one reason for the popularity of the term ‘data science’, and for the mixed feelings about Nate Silver’s comment at his JSM address that ‘data scentist is just a sexed-up term for statistician’. Some failure modes of statistics research talks http://notstatschat.netlify.com/2013/08/04/some-failure-modes-of-statistics-research-talks/ Sun, 04 Aug 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/08/04/some-failure-modes-of-statistics-research-talks/ Written before #JSM2013 actually starts, so it’s not about your talk there. Also, this is about deliberate choices by the presenter, and specifically about statistics research talks. “The Overgeneralized Beta Distribution”. There is a place for new parametric distributions, but it’s a fairly small place and mostly occupied by distributions derived from underlying substantive knowledge. “Asymptotics of an uninteresting estimator”. If there were a novel mathematical idea this would be fine, but otherwise we know its asymptotic behavior and roughly why it happens, and we can’t read your notation fast enough anyway. Graphs and counterfactuals http://notstatschat.netlify.com/2013/07/15/graphs-and-counterfactuals/ Mon, 15 Jul 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/07/15/graphs-and-counterfactuals/ The two main ways of reasoning about cause and effect in statistics are causal graphs and counterfactuals. With causal graphs, you write down variables and draw arrows representing direct effects of one variable on another, and then work with a set of axioms that summarise what it means for one variable to affect another. With counterfactuals, you talk about the effect of a variable in terms of the difference between the actual outcome with the variable set one way and the ‘potential outcome’ if it had been set another way. Welfare as an addictive drug http://notstatschat.netlify.com/2013/07/15/welfare-as-an-addictive-drug/ Mon, 15 Jul 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/07/15/welfare-as-an-addictive-drug/ From the NZ Herald today Doctors have been told that putting patients on welfare is akin to putting them on “an addictive debilitating drug … not dissimilar to smoking”. Smoking is a really, really bad analogy here, since doctors would absolutely never recommend a patient starts smoking. It’s hard to imagine how someone with a medical degree could come up with that analogy. Welfare is hard to get off and probably bad for your health, but a better comparison would be something like sleeping pills or opioid analgesics: drugs that are risky, potentially dependence-inducing, and should be taken for the shortest possible time period, but that are absolutely medically necessary at times. Big data linear models http://notstatschat.netlify.com/2013/07/08/big-data-linear-models/ Mon, 08 Jul 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/07/08/big-data-linear-models/ The biglm package for R currently uses incremental QR decomposition, which fits linear models to big data in linear time and bounded memory, but doesn’t parallelize. It turns out that parallel computation is easy (and has been studied by Dongarra and the LAPACK folks). If you have two data chunks reduced to $R_1$ and $Q_1^TY_1$, and $R_2$ and $Q_2^TY_2$, just treat each $R$ as an $X$ and each $Q^TY$ as a $Y$ to merge the QR decompositions. Sparse linear systems and calibration of weights http://notstatschat.netlify.com/2013/07/08/sparse-linear-systems-and-calibration-of-weights/ Mon, 08 Jul 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/07/08/sparse-linear-systems-and-calibration-of-weights/ Diego Zardetto (Italian national stats agency) wants to be able to calibrate sampling weights to population totals for regions. This leads to a very large number of calibration variables and solving large linear systems. Using the Matrix package in R, we can compute sparse QR decompositions instead of the dense ones used in the survey package. Alternatively, using block-diagonal sparse matrices from the bdsmatrix package we can represent the linear system as a set of separate systems for each region. Problems with faithfulness and the causal Markov property (II) http://notstatschat.netlify.com/2013/07/06/problems-with-faithfulness-and-the-causal-markov-property-ii/ Sat, 06 Jul 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/07/06/problems-with-faithfulness-and-the-causal-markov-property-ii/ This one I got from reading Nancy Cartwright’s Hunting Causes, and using them, though it isn’t exactly the point she’s making. It’s also related to points made by Hofstadter, Dennett, and others about reductionist reasoning. The idea of causal graphs is that you have variables and some prior knowledge of possible causal relationships between them – the prior knowledge could be as weak as ‘future cannot cause past’ or could incorporate a lot of domain-specific knowledge. Problems with faithfulness and the causal Markov property (I) http://notstatschat.netlify.com/2013/07/02/problems-with-faithfulness-and-the-causal-markov-property-i/ Tue, 02 Jul 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/07/02/problems-with-faithfulness-and-the-causal-markov-property-i/ The causal Markov property says that you can write down causal relationships between variables in a directed acyclic graph so that each variable is affected only by its parents in the graph. The faithfulness property says that the variables will have exactly the conditional independence properties required by the graph. The first problem with these properties is measurement error. If the only causal relations are that A affects B and C, then B and C are conditionally independent given A. Upcoming talks and stuff http://notstatschat.netlify.com/2013/06/30/upcoming-talks-and-stuff/ Sun, 30 Jun 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/06/30/upcoming-talks-and-stuff/ Two modules (on intermediate and advanced R) with Ken Rice at the Seattle Summer Institute in Statistical Genetics Joint Statistical Meetings: analyzing large data with SQL generated by R, with Hannes Muhleisen. MonetDB is a database optimised for analysis tasks, and controlling it from R gives more flexibility and programmability. Two simple notes on error in regression models http://notstatschat.netlify.com/2013/06/28/two-simple-notes-on-error-in-regression-models/ Fri, 28 Jun 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/06/28/two-simple-notes-on-error-in-regression-models/ In regression, we often talk about the difference between the population line and the observations as “errors.” In some introductory texts these are even called “measurement errors” in Y. Sometimes they are errors in Y, and sometimes they are even measurement errors in Y, but much more often Y is the truth and the ‘error’ is the error in predicting Y by a straight line. As Dan Davies observed (from memory) “The Great Depression really happened; it wasn’t just an unusually inaccurate observation of an underlying 4% return on equities” When is Bayesian introductory statistics better? http://notstatschat.netlify.com/2013/06/27/when-is-bayesian-introductory-statistics-better/ Thu, 27 Jun 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/06/27/when-is-bayesian-introductory-statistics-better/ For the sort of statistics taught in introductory courses, competent Bayesian and frequentist analysis are going to agree – point and interval estimates will be similar, and similar conclusions will be drawn. Computation isn’t seriously hard for either approach, though prepackaged pointy-clicky software is more available for frequentist inference. There are going to be pedagogical differences. The big one, in favour of Bayesian statistics, is not having to explain p-values. My Setup http://notstatschat.netlify.com/2013/06/08/my-setup/ Sat, 08 Jun 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/06/08/my-setup/ For UsesThis.com, prompted by Luis Apiolaza. Who are you, and what do you do? I’m a statistics professor at the University of Auckland. I teach, do research in statistics and in epidemiology, and contribute to R. What hardware do you use? I’ve been using Mac laptops since 2001. I currently have an aluminium MacBook from 2009 and am waiting on delivery of an 11in MacBook Air. They are nicely solid, have reasonable keyboards, run Unix, and support the Microsoft software that my collaborators use. Hello World http://notstatschat.netlify.com/2013/06/07/hello-world/ Fri, 07 Jun 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/06/07/hello-world/ Hello, world This will be my not-StatsChat blog, for things that are too technical, too political, or simply not relevant to StatsChat. Talks in the near future http://notstatschat.netlify.com/2013/06/07/talks-in-the-near-future/ Fri, 07 Jun 2013 00:00:00 +0000 http://notstatschat.netlify.com/2013/06/07/talks-in-the-near-future/ “Filtering for rare variant tests” CHARGE consortium workshop, Rotterdam. Basic message: it’s total count of rare alleles that matters. “R: an environment for statistical computing and graphics”. CWI, Amsterdam. Talking about the history, design, and applications of R to a scary computer-science audience. “Testing rare DNA variants in unrelated individuals: experience from the CHARGE consortium”, IARC, Lyon. On unidirectional and omnidirectional ‘burden of mutation’ tests using rare DNA variants. Lorem Ipsum http://notstatschat.netlify.com/1520/01/01/lorem-ipsum/ Thu, 01 Jan 1520 13:09:13 -0600 http://notstatschat.netlify.com/1520/01/01/lorem-ipsum/ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.