Biased and Inefficient
Wed, 22 Jan 2020 00:00:00 +0000http://notstatschat.netlify.com/2020/01/22/survey-package-news/Version 3.37 of the survey package is on CRAN now.
New features svyquantile now takes account of design degrees of freedom in computing confidence intervals and in turning confidence intervals into standard error estimates. This means results will change (slightly, and for the better). svyivreg for two-stage least squares with instrumental variables. (described here) withPV for ‘plausible value’ analyses now supports replicate-weight designs. ‘Plausible values’ are how education people describe multiple imputation.Multifactor interventions and interactions
Thu, 09 Jan 2020 00:00:00 +0000http://notstatschat.netlify.com/2020/01/09/most-multifactor-intervetions-have-interactions/The Multiphase Optimisation Strategy for designing multifactor behavioural interventions should be used more. The idea is that you have a lot of potentially good ideas for things that might work, alone or in combination. You don’t want to test them one at a time, because that takes forever. You don’t want to test all against none, because they might not all be compatible, and in any case you don’t want to be stuck doing them all if you don’t need to.Computer says no
Mon, 30 Dec 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/12/30/computer-says-no/This was my reply to the call for comments on the proposed Algorithms Charter
Introduction I am a statistician with an interest in the technical details of predictive models and in their social impact. I am interested from the points of view of a researcher, a university teacher, a science communicator, and an immigrant.
The proposal The key points in the proposal are laid out as a set of bulletsWhat is 'Data Science Practice'?
Wed, 25 Dec 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/12/25/what-is-data-science-practice/Two years ago I started a 3rd-year (final-year) undergraduate course called ‘Data Science Practice’1. It’s the main new course in our undergraduate Data Science major – we already had a couple of courses in statistical computing, and already used R Markdown starting in second-year, and the Computer Science department teaches algorithms and data structures, and database theory and so on. This year, the course will be taught without me – I’m teaching a postgraduate translation of it.How many giraffes?
Sun, 01 Dec 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/12/01/how-many-giraffes/Since it’s, if not Christmas, at least Advent, here’s a book review. I’ve been following Janelle Shane’s blog for years. Her book You Look Like a Thing and I Love You came out a few weeks ago; I bought it as my post-exam-grading reward.
The blog is a series of examples of surrealist comedy from neural networks either getting things wrong (hallucinating giraffes in every photo) or generating text (the book title was one of the best neural-net pick-up lines).Hexmaps for NZ District Health Boards
Thu, 07 Nov 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/11/07/hexmaps-for-nz-district-health-boards/I’m involved a research project that, among other things, will be comparing various health variables across NZ District Health Boards (DHBs). In order to make the outputs less boring and (hopefully!) more interpretable, I want some maps.
This post is about ‘DHBins’, a set of hexmaps vaguely analogous to the square ‘statebins’ for US states. The code is in the DHBins package I’ll illustrate with some data on immunisation coverage in NZ kids.Some things I don’t like about the Oxford-Munich Code of Conduct
Tue, 01 Oct 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/10/01/some-things-i-don-t-like-about-the-oxford-munich-code-of-conduct/The Oxford-Munich Code of Conduct for Professional Data Scientists (http://www.code-of-ethics.org/code-of-conduct/) is worth reading. It’s fairly detailed and has some good features. There are also things I don’t like about it, which are why I didn’t include it in my Data Science Practice course. It’s a bit inconsistent in style at the moment, but (a) it’s a draft under development and (b) I may not have the moral high ground on this point, so that’s not what I’m complaining about.How to review a book
Fri, 13 Sep 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/09/13/how-to-review-a-book/The old-fashioned1 way to review a book, such as, say, Randall Munroe’s how to, involves reading it.
On the positive side, you can learn about The Effects of Nuclear Explosions on Commercially-Packaged Beverages2 and find out what Colonel Chris Hadfield thinks should be treated as “a big angry hang glider”3.On the negative side, there are all those words and pictures and footnotes and index entries you have to read.
You could ask an AI to do it.(What’s up with the brackets?)
Tue, 10 Sep 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/09/10/what-s-up-with-the-brackets/In various places, from the R help pages to books to course materials, you see R code like
(x = runif(10, 0, 10)) ## [1] 1.610466 2.662462 1.517036 1.372483 4.272460 6.402148 8.347196 ## [8] 8.775537 1.863104 8.912241 which displays the value of x. Without the brackets, it doesn’t. Harkanwal Singh, on Twitter, said “I would like to know more”. So, in case he’s not the only one, this is what’s going on.Why isn't rimu tidy?
Tue, 10 Sep 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/09/10/why-isn-t-rimu-tidy/The rimu package, which I published last week, does not use the tidyverse.
The operations that I do on multiple-response data would be easy using dplyr or purrrrrr with the data in long form: all responses stacked. The problem is that dplyr and rlang are not automatically type-safe for this sort of multiple-response data.
It seems to be easier to define a multicolumn S3 class, which can then be put into a single column of a data frame, egA package for multiple-response data
Thu, 05 Sep 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/09/05/a-package-for-multiple-response-data/Multiple-response data is like factor data, except that you can be in more than one category. Examples include
what is your ethnicity? (or, in the US, race/ethnicity, sigh) what social media do you use? what countries have you been to? what birds did you see this week? I have the first version of a package to manipulate this sort of data, called rimu. The name stands for responses in multiplex, but rimu is also the name of a New Zealand tree, Dacrydium cupressinum, an attractive conifer with reddish wood.Adding new functions to the survey package
Tue, 16 Jul 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/07/16/adding-new-functions-to-the-survey-package/I had an email request about making ivreg work with the survey package. That’s AER::ivreg, which does two-stage least-squares estimation with instrumental variables.
The steps are
See if it accepts weights and does the right thing for point estimation If so, work out how to get the complex-survey variances Test to make sure it’s getting the right answer In this case, the first step was fairly straightforward. The function accepts weights and passes them to lm.Denominator degrees of freedom in svyglm
Wed, 26 Jun 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/06/26/denominator-degrees-of-freedom-in-svyglm/Attention Conservation Notice: This is a working note; when I understand it better, there will be changes in the survey package.
The design degrees of freedom for a stratified, clustered design with \(M\) clusters and \(H\) strata is \(d=M-H\). This is a straightforward definition, since the Horvitz–Thompson variance estimator for a mean or total is a variance of \(M\) cluster summaries after subtracting off \(H\) stratum means. While the definition is only straightforward for single-stage designs, the public-use versions of nearly all surveys are analysed as if they were single-stage designs.Wald, score, LRT: the picture
Thu, 20 Jun 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/06/20/wald-score-lrt-the-picture/One issue in teaching generalised linear models (or likelihood theory) is the relationship between the Wald, score, and likelihood ratio tests. I have a picture.
Let’s make up a score function \(U(\theta)\), in this case for a trivial binomial model, and draw it.
logit <-function(p) log(p/(1-p)) expit <-function(x) exp(x)/(1+exp(x)) U<-function(theta) 11/12-expit(theta) thetahat<-logit(11/12) curve( U(x),from=0, to =3, xlab=expression(theta),ylab=expression(U(theta))) abline(h=0,lty=2) abline(v=0,lty=2) The likelihood ratio statistic is twice the area under the curve \[-2(\ell(\hat\theta)-\ell(0))= 2 \int_0^{\hat\theta}U(\theta)\,d\theta\]Analysing the mouse microbiome autism data
Sun, 16 Jun 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/06/16/analysing-the-mouse-autism-data/The issue A paper reporting the induction of autism-type behaviour in mice by fecal microbiome transplants from humans was recently published in Cell. Some people on Twitter were discussing subplots E, F, and G of Figure 1, which report behavioral comparisons of the mice between fecal donors with and without ASD.
The expressed view on Twitter was the the plots weren’t consistent with the \(p\)-values given. They didn’t entirely need to be, since the \(p\)-values weren’t from a simple two-group comparison, but even taking that into account I was surprised.Confidence intervals: not a very strong property
Tue, 11 Jun 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/06/11/confidence-intervals-not-a-very-strong-property/It’s important for non-Bayesians (or non-exclusively-Bayesians) to remember that being a 95% confidence interval procedure is a fairly weak property. It’s not that confidence intervals are necessarily bad, but if they aren’t, it’s because of other requirements.
As an extreme case, consider the all-purpose data-free exact confidence interval procedure for any real quantity: roll a d20 and set the confidence interval to be the empty set if you roll 20, and otherwise to be \(\mathbb{R}\).Design degrees of freedom: brief note
Sat, 08 Jun 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/06/08/design-degrees-of-freedom-brief-note/An important concept in multistage survey analysis is the design degrees of freedom, which describes (or estimates) how many independent observations go into calculating variances, in a similar way to error degrees of freedom in experimental designs.
In straightforward multistage designs the design df is the number of primary sampling units minus the number of strata, because each PSU provides data to supply degree of freedom and each stratum implies a constraint that removes a degree of freedom.Mean People Tweet
Fri, 24 May 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/05/24/mean-people-tweet/There is discussion from time to time on Kiwi Twitter about which public figures get treated worse on Twitter. Eric Crampton suggested that it would be easy to answer this question empirically, by analysing tweet sentiment. I wasn’t convinced, but I tried it. This post is about what I found.
First, we need some way of classifying sentiment. I’ve got lists of about 2000 positive and 5000 negative words, collected by Bing Liu.The Reeferendum
Tue, 07 May 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/05/07/the-reeferendum/Elections and opinion polls look a bit similar from a distance, but they’re very different beasts. An election is a decision mechanism, an opinion poll is an estimation procedure. If you want an election, the Electoral Commission does an excellent job; if you want an opinion survey, you might try Colmar Brunton or Stats NZ.
Binding referendums1, in the New Zealand model, are like elections. There’s a proposed change in the law, which we hope has been carefully drafted, put out for public commemt, and the whole bit, and which is then put up for vote.Local asymptotic minimax, and nearly-true models
Tue, 30 Apr 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/04/30/regularity-local-minimax-and-nearly-true-models/I’ve written a bunch of times about nearly-true models. The idea is that you have some regression model for \(Y|X\) you’re trying to fit with data from a two-phase sample with known sampling probabilities \(\pi_i\) for individual \(i\). You know \(Y\) and some auxililary variables \(A\) for everyone, but you know \(X\) only for the subsample. If you had complete data, you’d fit a particular parametric model for \(Y|X\), with parameters \(\theta\) you’re interested in and nuisance parameters \(\eta\), call it \(P_{\theta,\eta}\).Survey package update
Sun, 28 Apr 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/04/28/survey-package-update/Version 3.36 of the survey package and version 2.4 of mitools are up on CRAN.
There’s one notable new feature in both of them: handling ‘plausible values’, where you have some sets of multiply-imputed variables just as additional columns in a largely non-imputed data set.
There are two implementations behind withPV, controlled by the rewrite= option. You have variables PV1MATH, PV2MATH,…,PV5MATH and some code with a variable maths that you want to run with maths being each of the plausible values in turn.That’s for remembrance
Wed, 24 Apr 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/04/24/that-s-for-remembrance/Rosemarinus officinalis, by Samules on Pixabay
April 25th is not the anniversary of a victory, or an armistice, or a successfull retreat like Dunkirk, or even a tragic last stand like Masada. It was the sort of badly-planned, poorly-executed debacle that typified the Great War. The day is remembered because it was the first day that large numbers of soldiers from Australia and New Zealand were killed.
April 25th is an ideal setting for a military commemoration.Handling ‘plausible values’ in surveys
Sun, 21 Apr 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/04/21/handling-plausible-values-in-surveys/Surveys (especially educational surveys) have a thing called ‘plausible values’, which are a form of multiple imputation, only by design rather than because of non-response. To reduce effort, not everyone answers every question. Often, there are a lot of variables that don’t need imputing, and a few that do.
The data example I showed in the last post, for mixed models, has five plausible values for the maths score. I only used PV1MATH.Progress on linear mixed models for surveys
Fri, 19 Apr 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/04/19/progress-on-linear-mixed-models-for-surveys/In our last episode, we worried about the penalised least squares criterion for linear mixed models. The linear mixed model is \[Y=X\beta+Zb+\epsilon\] where \(b\sim N(0, \sigma^2V_\theta)\) and \(\epsilon\sim\sigma^2\), and where \(\theta\) are variance parameters. It’s convenient to write \(b=\Lambda_\theta u\) for iid standard Normal \(u\), where \(\Lambda_\theta\) is then a square root of \(V_\theta\). The penalised least squares approach says that for given \(\theta\), we choose \(b\) and \(\beta\) to minimise \[\|Y-X\beta-Zb\|_2^2+\|u\|_2^2.Hypergraph network meta-analysis
Tue, 26 Mar 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/03/26/hypergraph-network-meta-analysis/‘Network meta-analysis’ is the only term I’ve coined that’s actually entered the general biostatistical vocabulary. In network meta-analysis we work with a network of randomised trials, where the nodes are interventions and the edges are trials. A single edge represents a direct randomised comparison; a path represents an indirect (but unconfounded) estimate from multiple trials; we can combine multiple paths between interventions using a linear model.
For example, if \(\log RR_{ij,j}\) is the log relative risk in the \(k\)th trial of treatments \(i\) and \(j\), and it has variance \(\sigma^2_{ij,k}\) \[\log RR_{ij,k} = \beta_i-\beta_j+\epsilon_{ij,k}\] where \(\epsilon_{ij,k}\sim N(0,\sigma^2_{ij,k})\).The school climate strike
Tue, 12 Mar 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/03/12/the-school-climate-strike/I have seen the school ‘climate strike’ in NZ being described as a publicity stunt that won’t provide any real solutions. No shit.
That’s not a criticism, any more than it’s a criticism of Shaun Hendy’s no-fly-year in 2018. The point of public protest — everything from ‘peaceably assemble and petition the government’ to dumping manure on the streets of Paris — isn’t to solve a problem, it’s to get the problem on the agenda.Normal horizontiles
Mon, 04 Mar 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/03/04/normal-horizontiles/From XKCD today
How can we check this calculation?
First, we need to know where the lines are on the y-axis. They are separated by 52.7% of the height of the distribution, and looks as if they are meant to exclude the same height above and below. We don’t need to worry about the mean of the distribution (obviously), or the scale (less obviously). The reason the scale is not needed is that rescaling the x axis shrinks all three of the areas under the curve by the same factor, and since they add up to 1, they stay the same.Displaying bus punctuality
Fri, 01 Mar 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/03/01/displaying-bus-punctuality/A couple of years ago, I stored a lot of Auckland bus location data for what was going to be a news story. It’s about time I did something with it, so I’m updating the analysis and I’ll be using it as a class example.
The data come from the Auckland Transport real-time API, for which Auckland Transport should be congratulated. Anyone can get an API key and use the data.Absolutely no warranty?
Mon, 18 Feb 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/02/18/absolutely-no-warranty/Someone on Twitter (who I’m reluctant to out, considering) said they’d got reviews back from an epidemiology journal, and one reviewer had cautioned against the use of R, because “the opening sentence in an R output points out that basically you use it at your own risk and the contributors to R are not accountable for any errors.” It’s been while since I’ve seen this one, which is a clear symptom of not having read licence agreements for other statistical software.What have I got against the Shapiro-Wilk test?
Sat, 09 Feb 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/02/09/what-have-i-got-against-the-shapiro-wilk-test/The Shapiro-Wilk test is a test of the null hypothesis that data come from a Normal distribution, with power against a wide range of alternatives. So what do I have against it?
Well, to start with, it’s a test of the null hypothesis that data come from a Normal distribution, with power against a wide range of alternatives.
There are two reasons you might want a test of the hypothesis that data come from a particular distribution \(P\) or a particular set of distributions (ie, model) \({\cal P}_\theta\).How do you tell what packages to trust?
Mon, 04 Feb 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/02/04/how-do-you-tell-what-packages-to-trust/I’m not on that list, but I do have reckons anyway.
First, there are really two questions: is the method useful to you, and is the implementation doing it the way you want?
The first question is important. There’s no benefit in having a well-coded implementation of the Shapiro-Wilk test unless you have a good reason to use the Shapiro-Wilk test1. The first question is not a coding question, but the answer is similar to the answer for the coding question.Recognising when you don’t know
Fri, 01 Feb 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/02/01/recognising-when-you-don-t-know/Saying that you don’t know something can be hard for people; it’s also hard for prediction algorithms.
There’s an example in the xgboost package for R involving the classification of mushrooms. The goal is to use information about the appearance of the mushrooms to decide if they are edible or not. It’s clear that this is a machine learning problem rather than a data science problem, because the version of the data in the xgboost package doesn’t say which output value means ‘edible’ and which one means ‘inedible’.Two quick survey items
Sat, 26 Jan 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/01/26/two-quick-survey-items/Can we invent the case-control design? Classical survey analysis is about means and totals, and the way to adapt it to more interesting parameters is to write the parameter as the mean of its influence functions (delta-betas, jackknife values, etc)
Suppose we knew for everyone in a population (maybe an HMO) whether they had a disease (\(Y=1\) or didn’t (\(Y=0\)) and we wanted to take a sample, measure a variable \(X\), and do logistic regression.Another way to see why mixed models in survey data are hard:
Fri, 18 Jan 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/01/18/another-way-to-see-why-mixed-models-in-survey-data-are-hard/Suppose you have a (potentially unequal-probability) sample of schools, and within each school a (potentially unequal-probability) sample of students, and you want to fit a linear mixed model. In fact, let’s take the brutally simple example of a random intercept model: \[Y_{ij}=X_{ij}\beta+b_i+e_{ij}\] where \(b\sim N(0,\tau^2\))$.
With population data, the penalised least squares formulation of this model (which Doug Bates likes) involves minimising \[\sum_i\sum_j (y_{ij}-\hat y_{ij})^2+\sum_i u_i^2\] where \(u_i=b_i/\tau\). You can use the EM algorithm (if you have all week) or you can rewrite as a least-squares problem in augmented data; right now I don’t care how you do it.The Ihaka Lectures 3: Rise of the Machine Learners
Fri, 11 Jan 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/01/11/the-ihaka-lectures-3-rise-of-the-machine-learners/They’re back!
On Wednesday evenings in March (and streaming on the internet) the University of Auckland Stats department will again be hosting the Ihaka Lectures. This year the theme is statistical learning/machine learning/predictive algorithms, and we have four speakers
Bernhard Pfahringer is Professor of Computer Science at the University of Waikato. He is a member of the Weka project, New Zealand’s other famous open-source data science contribution. He will talk about the design and development of Weka and more recent projects.Bayesian Surprise — the Shiny app
Fri, 04 Jan 2019 00:00:00 +0000http://notstatschat.netlify.com/2019/01/04/bayesian-surprise-the-shiny-app/I wrote a while back about a toy case of the Bayesian surprise problem: what does Bayes Theorem tell you to believe when you get really surprising data. The one-dimensional case is a nice math-stat problem, if you like that sort of thing, but maybe you’d rather have the calculations done for you.
Here’s an app
The mathematical setup is that you have a prior distribution for a location parameter \(\theta\) centered at zero, and you see a data point \(x\) that’s a long way from zero.What are packages for?
Mon, 17 Dec 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/12/17/what-are-packages-for/It’s an interesting question, but the implication of wasted depends not just on the actual statistics about package survival (which we’ll get to), but on why people write packages. And, I suppose, on why they should write packages.
One reason people write packages is to improve other people’s data analysis, certainly. But it’s not the only reason, nor should it be. People write packages to provide reference implementations of new statistical methods.svycontrast
Mon, 10 Dec 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/12/10/svycontrast/I got asked for more detail about the svycontrast() function, so I thought I’d post it here too. The function is related to the CONTRASTS you get in SAS, but focused on estimation rather than testing.
The input to svycontrast() is a \(p\)-vector of estimates \(\hat\theta\) (which I’ll consider as a column vector) and an estimated \(p\times p\) covariance matrix \(\hat\Xi\)
There are two main cases:
Linear Given a \(p\)-vector of coefficients \(b\), the function computes \(b^T\hat\theta\) and \(b^T\hat\Xi b\).Finding principal components without even looking?
Mon, 26 Nov 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/11/26/finding-principal-components-without-even-looking/Via Scott Aaronson’s blog I found an arXiv abstract and then an early paper (PDF) about doing singular value decomposition of an \(m\times n\) matrix in less than \(O(mn)\) time. That is, you could estimate population structure with principal components of a genotype matrix or work out tail probabilities for a quadratic-form-based test in less time than it takes to actually look at the data. That’s obviously impossible, and so that’s not what the paper actually says.Come work with us
Sun, 04 Nov 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/11/04/come-work-with-us/We have four academic positions open in the Department of Statistics. Come work with us!
(j.e.mcgowan on Flickr, CC-BY)
There are three standard academic positions, at lecture, senior lecturer, and associate professor level. In American these would approximately translate as hard-money tenure-track assistant, associate, and full professor. There is also one position for a professional teaching fellow – a full-time, permanent job that focuses on teaching.
(Salman Javed on Flickr, CC-BY-SA)Progress on svy2lme
Fri, 19 Oct 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/10/19/progress-on-svy2lme/The svy2lme package for linear mixed models under complex sampling may still contain nuts, but at least the user interface has settled down and it gives plausible answers for some toy examples. The recent change is to compute pairwise sampling probabilities from a survey design object, rather than some horrible set of separate specifications. It still doesn’t support complicated PPS designs, but since the survey package does, that should be feasible.Survey package update
Fri, 12 Oct 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/10/12/survey-package-update/There’s a new version of the survey package on CRAN, version 3.34. Mostly this is bug fixes and minor enhancements accumulated over rather too long since the last update. There are a couple of things worth noting specifically, though.
The first is a change to svyglm with replicate weights. When fitting generalised linear models with large weights (eg from US national surveys), you can run into numerical instabilities. I’ve handled this for a long time by rescaling the weights inside svyglm.The Kiwi PRNG
Thu, 04 Oct 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/10/04/the-kiwi-prng/As I’ve written before, New Zealand has a National Pseudo-Random Number Generator. It’s kept in Part 5 of Schedule 1A to the Local Electoral Regulations, Clauses 41-48. And there’s a bug in it.
The generator is obviously intended to be the Wichmann-Hill generator, perhaps because of the paper by Hill, Wichmann, and Woodall (1987) presenting a Pascal program to count Single Transferable Vote.
Translating the regulations to code gives
make.prng<-function(candidates, vacancies, votes){ x <-candidates+5 y <- vacancies z <- (votes +1000*(votes %% 10)) %% 30323 function(){ x <<- (171*x) %% 30269 y <<- (172*y) %% 30307 z <<- (170*z) %% 30323 (10000*x) %/% 30269 + (10000*y) %/% 30307 + (10000* z) %/% 30323 } } There are two differences from the original generator.How to write a racist AI in R without really trying
Thu, 27 Sep 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/09/27/how-to-write-a-racist-ai-in-r-without-really-trying/Last year, Rob Speer wrote a really great post How to make a racist AI without really trying. Go read it.
The idea is to do sentiment analysis with obvious, off-the-shelf tools. As the post says
So that’s what we’re going to do here, following the path of least resistance at every step, obtaining a classifier that should look very familiar to anyone involved in current NLP.
The original post used Python and I’m teaching an undergraduate data science course using R at the moment, so I wanted an R version.Journalism and cyber-bullying
Tue, 11 Sep 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/09/11/journalism-and-cyber-bullying/Newsroom, an online-only New Zealand news site, has written a series of stories critical of Sir Ray Avery and his R&D efforts in medical devices. According to the most recent story, Sir Ray is attempting to use the Harmful Digital Communications Act to get these stories removed
Avery has told Netsafe, the legal agent for considering complaints under the Act, the reports have caused him serious emotional distress and amount to a form of digital harm - and wants Newsroom to consider removing them and to agree not to write further news stories about him.What can data science add to statistics education?
Tue, 28 Aug 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/08/28/what-can-data-science-add-to-statistics-education/(for Deborah Nolan and Louise Ryan, ISCB/ASC 2018, after Henry Reed)
Today we have naming of stats. Yesterday
We had assumptions. And tomorrow morning
We shall have testing of assumptions. But to-day
Today we have naming of stats. Data
sparkles and flashes through all of the students’ phones.
And today we have naming of stats.
This is the rank-sum Wilcoxon test. And this
is the one-sample Wilcoxon test, whose use you will seeISCB/ASC talk
Sun, 26 Aug 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/08/26/iscb-asc-talk/Notes and references for my keynote at the joint International Society for Clinical Biostatistics and Australian Statistical Conference:
Abstracts from the JSM on imputation and weighting for audit subsets: Shepherd & Giganti; Shaw & Oh Preprint on nearly-true models, and a couple of old blog posts Blog post on relative efficiency of weighted estimation in case-control studies The paper that coined “using the whole cohort” software for multiple imputation on large databases: missForest, MIDAS The relationship between AIPW, calibration of weights, and adjusting for baseline Paper by Peisong Han showing that multiple imputation gives the optimal double-robust estimator (and therefore the optimal weighted estimator when the sampling probabilities are correct) review paper on survey-weighted regression by me and Alastair Scott Photos from unsplash.Leaflet and buses
Tue, 14 Aug 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/08/14/leaflet-and-buses/Where are the buses? Wellington’s bus system has been the subject of negative attention in the news and on Twitter. Also, I’m teaching a course in Data Science Practice and we’re just getting to a lab on maps with Leaflet. So I thought I’d make a map of Wellington buses and their lateness – people do tend to overestimate problems with public transit, and if they aren’t overestimating it, that’s also important to know.Testing probability distribution generators
Wed, 01 Aug 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/08/01/testing-probability-distribution-generators/In the ‘regression tests’ that are part of any change to the base-R source code, there’s a file called p-r-random-tests.R. People notice it from time to time because the tests sometimes fail. That’s what is supposed to happen.
Testing random number generators is hard, because it’s hard to specify what the results should be: you need statistics. Fortunately, we have statistics, so it’s not impossible. The random tests check that, eg, pnorm() is not ruled out as the cumulative distribution function of numbers from rnorm().Quoting and macros in R
Mon, 30 Jul 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/07/30/quoting-and-macros-in-r/Miles McBain has a nice post about quoting in R and the tidyeval procedure. In it, there’s this footnote
In truth there are other types of calls, and the ones Lisp nuts really bang on about are macro calls
In this post I want to talk about the similarities between the tidyversatile approach to quasiquoting and the base-R approach, as an introduction to banging on about macro calls.e-bike: the reboot
Tue, 17 Jul 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/07/17/e-bike-the-reboot/Q: What do you mean, “reboot”? Like, a new e-bike with only a vague brand-name resemblance to the original?
A: Pretty much. I’m 200km into a SmartMotion Pacer GT.
Q: How fast does it go?
A: Haven’t we talked about that question?
Q:
A: Ok. It’s faster than the old one, because it has a wider range of gears. I get up to 45km/h down Manukau Rd toward Royal Oak. Which is actually faster than I’m happy going on Auckland roads.Interlingual
Wed, 11 Jul 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/07/11/interlingual/I don’t normally explain jokes, but it’s clear from the useR conference that the name of the new R package reticulate divides people into two groups: amused or bemused.
The word reticulate barely exists in modern English. It comes from the Latin reticulum, ‘small net’, the diminutive of rete, net. In NZ and Australian usage, mains water – supplied by a network of pipes – is called a ‘reticulated water supply.Spell my name with a ‘v'
Sun, 24 Jun 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/06/24/spell-my-name-with-a-v/Admittedly, it’s conceivable for parents to make a simple error in assigning a child’s name.
In a fictional example from Kerry Greenwood’s ‘Phryne Fisher’ series, the protagonist’s hungover father accidentally named her after Phryne the Greek courtesan rather than Psyche. In the real world, Isaac Asimov’s father incorrectly transliterated the Cyrllic Азимов as ‘Asimov’ rather than ‘Azimov’.
By and large, though, the idea that someone is simply incorrect about their name or their child’s name falls under “not even wrong”.Statistical software matters
Sat, 09 Jun 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/06/09/statistical-software-matters/This is a picture of all the genetic associations found in genome-wide association studies, sorted by chromosome. You can find more detail at the NHGRI GWAS catalog
There are two chromosomes with many fewer associations. One is the Y chromosome. There isn’t much there because there isn’t much there. The other is the X chromosome. There isn’t much there because GWAS took a lot longer to get started for the X chromosome, and that’s partly for software reasons.Survey analysis in SQL
Sat, 09 Jun 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/06/09/survey-analysis-in-sql/Charco Hui, as his Honours project in Statistics, has been writing a package for complex-survey analysis using dplyr and dbplyr. It’s here. At the moment it has only been tested with MonetDB, using the github version (0.5.2) of MonetDBlite, but it should work with many other databases (not SQLite, at the moment). I hope it’s still under development: the approach does seem to be useful for large survey data sets – and for smaller data sets the dplyr version is faster than the survey package, though more limited.New blog home
Tue, 05 Jun 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/06/05/new-blog-home/One way to prove you can really keep up with the cool kids is to move your blog to GitHub the day after they get bought by Microsoft.
I’m not actually worried by that: one of the key features of a git repository is that it doesn’t have the only copy of any of your stuff. The main motivation for switching was to use blogdown rather than Tumblr, because my blog is mostly text.Biased and Inefficient
Fri, 01 Jun 2018 00:00:00 +0000http://notstatschat.netlify.com/about/I’m a statistical researcher in Auckland. This blog is for things that don’t fit on my department’s blog, StatsChat.
I also tweet as @tslumleyGraduation
Mon, 14 May 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/05/14/graduation/So, I gave one of the graduation addresses during Silly Hat Week last week. It’s not my usual style of writing, but since there’s no point being embarrassed about a speech you’ve already given in front of your boss, your boss’s boss, and 2000 other people, I thought I’d post it here. Chancellor, Vice-Chancellor, Members of Council, fellow Members of the University, Graduands, families and friends: kia ora tātou
It’s five o’clock on a Friday.svylme
Sun, 01 Apr 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/04/01/svylme/I’m working on an R package for mixed models under complex sampling. It’s here. At the moment, it only tries to fit two-level linear mixed models to two-stage samples – for example, if you sample schools then students within schools, and want a model with school-level random effects. Also, it’s still experimental and not really tested and may very well contain nuts.
The package uses pairwise composite likelihood, because that’s a lot easier to implement efficiently than the other approaches, and because it doesn’t have the problems with nonlinearity and weight scaling.Small p hacking
Fri, 23 Mar 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/03/23/small-p-hacking/The proposal to change p-value thresholds from 0.05 to 0.005 won’t die. I think it’s targeting the wrong question: many studies are too weak in various ways to provide the sort of reliable evidence they want to claim, and the choices available in analysis and publication process eat up too much of that limited information. If you use p-values to decide what to publish, that’s your problem, and that’s what you need to fix.Chebyshev’s inequality and `UCL’
Thu, 15 Mar 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/03/15/chebyshevs-inequality-and-ucl/Chebyshev’s inequality (or any of the other transliterations of Чебышёв) is a simple bound on the proportion of a distribution that can be far from the mean. The Wikipedia page, on the other hand, isn’t simple. I’m hoping this will be more readable.
We have a random quantity \(X\) with mean \(\mu\) and variance \(\sigma^2\), and – knowing nothing else – we want to say something about the probability that \(X-\mu\) is large.Why pairwise likelihood?
Tue, 13 Mar 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/03/13/why-pairwise-likelihood/Xudong Huang and I are working on fitting mixed models using pairwise composite likelihood. JNK Rao and various co-workers have done this in the past, but only for the setting where the structure (clusters, etc) in the sampling is the same as in the model. That’s not always true. The example that made me interested in this was genetic analyses from the Hispanic Community Health Survey. The survey is a multistage sample: census block groups and then households.Faster generalised linear models in largeish data
Mon, 05 Mar 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/03/05/faster-generalised-linear-models-in-largeish-data/There basically isn’t an algorithm for generalised linear models that computes the maximum likelihood estimator in a single pass over the \(N\) observatons in the data. You need to iterate. The bigglm function in the biglm package does the iteration using bounded memory, by reading in the data in chunks, and starting again at the beginning for each iteration. That works, but it can be slow, especially if the database server doesn’t communicate that fast with your R process.Useful debugging trick
Wed, 31 Jan 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/01/31/useful-debugging-trick/If you have a thing with lots of indices, such as a fourth-order sampling probability \(\pi_{ijk\ell}\) (the probability that individuals \(i\), \(j\), \(k\) and \(\ell\) are all sampled), there will likely be scenarios where it has lots and lots of symmetries. A useful trick is to write a wrapper that checks them:
FourPi<-function(i,j,k,l){ answer <- FourPiInternal(i,j,k,l) sym <- FourPiInternal(j,i,k,l) if (abs((answer-sym)/(answer+sym))>EPSILON) stop(paste(i,j,k,l)) answer } Other useful tricks:
The score (deriviative of loglikelihood) has mean zero at the true parameters under sampling from the model, even in finite samples Quite a few design-based variance estimators are unbiased for the sampling variance even in small samples.More tests for survey data
Mon, 22 Jan 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/01/22/more-tests-for-survey-data/If you know about design-based analysis of survey data, you probably know about the Rao-Scott tests, at least in contingency tables. The tests started off in the 1980s as “ok, people are going to keep doing Pearson \(X^2\) tests on estimated population tables, can we work out how to get \(p\)-values that aren’t ludicrous?” Subsequently, they turned out to have better operating characteristics than the Wald-type tests that were the obvious thing to do – mostly by accident.The Ihaka Lectures
Mon, 22 Jan 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/01/22/the-ihaka-lectures/They’re back!
This year our theme is visualisation. The lectures will again run on Wednesday evenings in March. The three speakers work in different areas of data visualisation: collect the complete set!
Paul Murrell is an Associate Professor in Statistics here in Auckland. He’s a member of the R Core Development Team, and responsible for a lot of graphics infrastructure in R. The ‘grid’ graphics system grew out of his PhD thesis with Ross Ihaka.As far as it goes
Sat, 20 Jan 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/01/20/as-far-as-it-goes/I’ve been reading two somewhat depressing documents today.
The American Statistical Association has put out a position paper titled “Overview of Statistics as a Scientific Discipline and Practical Implications for the Evaluation of Faculty Excellence“. It says, in the executive summary
Statistics is at the same time a dynamic, stand-alone science with its own core research agenda and an inherently collaborative discipline, developing in response to scientific needs. In this sense, statistics fundamentally differs from many other domain-specific disciplines in science.breakInNamespace
Mon, 15 Jan 2018 00:00:00 +0000http://notstatschat.netlify.com/2018/01/15/breakinnamespace/Attention Conservation Notice: I’m putting this in a blog post in the hope it makes it easier for other people to find when they encounter the problem.
The !! and !!! quasiquotation syntax in R’s tidyverse will break if you run them through the parser and deparser. This means:
Printing out the code of a function at the command line may give the wrong code
Functions like fix(), fixInNamespace(), and edit() may break functions using quasiquotation.e-bike-onomics
Sat, 30 Dec 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/12/30/e-bike-onomics/If you own an e-bike, you get used to certain questions: how fast does it go, how often does it need to be charged, how much did it cost?
My e-bike was a bottom-end one two years ago – I didn’t know if I’d end up using it, so I didn’t spend more than I had to. Since then, the quality has generally gone up, and so has the price. Andrew Chen, who just bought one, says that reliable entry-level bikes are $2500-$3000.Statistics on pairs
Tue, 26 Dec 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/12/26/statistics-on-pairs/I’m interested in estimation for complex samples from structured data — clustered, longitudinal, family, network — and so I’m interested in intuition for estimating statistics of pairs, triples, etc. This turns out to be surprisingly hard, so I want easy examples. One thing I want easy examples for is the relationship between design-weighted \(U\)-statistics and design-weighted versions of their Hoeffding projections. That is, if you write a statistic as a sum over all pairs of observations, you can usually rewrite it as a sum of a slightly more complicated statistic over single observations, and I want to think about whether the weighting should be done before or after you rewrite the statistic.How to add chi-squareds
Wed, 06 Dec 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/12/06/how-to-add-chi-squareds/A quadratic form in Gaussian variables has the same distribution as a linear combination of independent \(\chi^2_1\) variables – that’s obvious if the Gaussian variables are independent and the quadratic form is diagonal, and you can make that true by change of basis. The coefficients in the linear combination are the eigenvalues \(\lambda_1,\dots,\lambda_m\) of \(VA\), where \(A\) is the matrix representing the quadratic form and \(V\) is the covariance matrix of the Gaussians.Secret Santa collisions
Sat, 25 Nov 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/11/25/secret-santa-collisions/Attention Conservation Notice: while this probability question actually came up in in real life, that’s just because I’m a nerd.
“Secret Santa” is a Christmas tradition for taming the gift-giving problem in offices, groups of acquaintances, etc. Instead of everyone wondering which subset of people they should give a gift to, each person is randomly assigned one recipient and has to give a gift (with a relatively low upper bound on cost) to that one person.When all U-shaped curves look the same to you
Thu, 23 Nov 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/11/23/when-all-u-shaped-curves-look-the-same-to-you/There was (as usual) controversy about some of the NCEA maths questions this year. Most of the controversy was about whether they assumed knowledge that the students hadn’t been told to know, but I’m going to worry about the pseudocontext problem
One question had the set up
and then (from the Herald), the question was
The maths problem is fine as a quadratic equation, I suppose. But the physics is wrong and the maths isn’t how any sane person would answer the question in reality.Means of maximums
Wed, 08 Nov 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/11/08/means-of-maximums/From a math point of view, it’s an interesting example of how the mean of the maximum of a set of random variables is higher than the max of the individual means – Andrew Gelman
Controlling the maximum of a set of random variables is an important problem in mathematical statistics, and it’s surprising how far a comparatively crude approach can be stretched.
Suppose you have \(m\) random variables \(X_1\), .Haere mai, statistical computing folks
Tue, 26 Sep 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/09/26/haere-mai-statistical-computing-folks/Later this year, Auckland is hosting the Asian regional meeting of the International Association for Statistical Computing. For the benefit of conference-goers, here’s a brief introduction to the locale. Nomenclature:
The Owen G. Glenn Building (OGGB, or building 260, in university abbreviations) is named after Owen G. Glenn. He’s a New Zealand businessman and philanthropist. Auckland is named after George Eden. The subantarctic Auckland Islands were not named after George but after his father William Eden.A genome analogy
Mon, 25 Sep 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/09/25/a-genome-analogy/DNA looks like a zipper. If you scale it up to a fairly fine-toothed zipper with tooth spacing of about 2mm or 1/12in, the human genome would run about from Auckland to Hawai’i.
On this scale, 1 Morgan is about 230km, so you inherit contiguous genome chunks from your grandparents of about 60km. The HLA region is about 6km long.
A million-variant SNPchip has markers spaced several meters apart.
A typical gene is maybe 20 meters from start to finish, but a lot of it is introns, and most exons are less than half a meter long.Bayesian surprise
Fri, 22 Sep 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/09/22/bayesian-surprise/For reasons not entirely unconnected with NZ election polling, I’ve been thinking about surprise in Bayesian inference again: what happens when you get a result that’s a long way from what you expected in advance? Yes, your prior is badly calibrated and you should feel bad, but what should you believe?
A toy version of the problem is inference for a location parameter. We have a prior \(p_\theta(\theta)\) for the parameter, and a model \(p_X(x|\theta)\).Visual design of diagnostics
Wed, 06 Sep 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/09/06/visual-design-of-diagnostics/Q: Are these curves parallel?
A: I mean, probably not? They look like they might be getting closer together, but if those big steps mean more uncertainty…
Q: Ok, how about with confidence intervals. Now are they parallel?
A: Um. I’m not sure that helped. Still a definite maybe
Q: Is this curve horizontal?
A: No. It slopes down. It crosses zero somewhere around 8 or 9 years.Causes and counterfactuals
Wed, 23 Aug 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/08/23/causes-and-counterfactuals/Attention Conservation Notice: this was on StatsChat four years ago, but I like it as a causation example.
A story in the Herald illustrates a subtle technical and philosophical point about causation. A Lotto winner says
“I realised I was starving, so stopped to grab a bacon and egg sandwich.
“When I saw they had a Lotto kiosk, I decided to buy our Lotto tickets while I was there.Wilcoxon and polymath: another update
Sat, 19 Aug 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/08/19/wilcoxon-and-polymath-another-update/As I wrote before, there’s a polymath (large-scale collaborative pure maths) project on transitivity of dice. Here’s the latest update from Timothy Gowers’s blog.
Suppose \(X\), \(Y\), and \(Z\) are discrete distributions supported on \(1,2,\\dots,n\). We can ask about \(P(X<Y)\) and \(P(Y<Z)\) and \(P(X<Z)\), which is what the Wilcoxon/Mann-Whitney rank test does. The project has basically proved that under one model for randomly choosing distributions, if \(X\), \(Y\), and \(Z\) have the same mean and \(P(X>Y)>1/2\) and \(P(Y>Z)>1/2\), the probability of \(P(X>Z)>1/2\) is \(1/2+o(1)\).The bus bot
Thu, 10 Aug 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/08/10/the-bus-bot/Back in January, I spent a few hours hacking together a script to tweet summaries of the Auckland bus system, on the account @tuureiti. People seemed to like it: the bot has 110 followers, many of whom appear to be actual people (or at least actual organisations).
A few times I’ve been asked for the source code and hadn’t gotten around to it, because the code is ugly, includes my API key, and is ugly.Psychoactive substances and Peter Dunne
Wed, 26 Jul 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/07/26/psychoactive-substances-and-peter-dunne/New Zealand, like a lot of places, has a problem with illegal sales of potent synthetic cannabinoid receptor agonists (aka ‘synthetic cannabis’, ‘synthetic marijuana’, ‘Spice’, ‘K2′, etc, etc). Peter Dunne, as the responsible Minister, is getting a lot of criticism. I don’t think Peter Dunne should be an MP. His party got 0.22% of the vote at the last election. In theory it’s conceivable he got in because he provides astonishingly good constituency service to Ohariu rather than as an edge case in the MMP voting system, but I find that hard to believe.Tail bounds under sparse correlation
http://notstatschat.netlify.com/2017/07/26/tail-bounds-under-sparse-correlation/
Wed, 26 Jul 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/07/26/tail-bounds-under-sparse-correlation/Attention Conservation Notice: Very long and involves a proof that hasn’t been published, though the paper was rejected for unrelated reasons.
Basically everything in statistics is a sum, and the basic useful fact about sums is the Law of Large Numbers: the sum is close to its expected value. Sometimes you need more, and there are lots of uses for a good bound on the probability of medium to large deviations from the expected value.Information and control
Tue, 25 Jul 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/07/25/information-and-control/There were delays on the Auckland rail system this morning, apparently due to a train hitting a person in south Auckland. It seems unreasonable to complain about the delays; Auckland Transport doesn’t have a warehouse of magic inflatable replacement trains, and owing to historic underfunding of trains, there isn’t a lot of redundancy in the physical track network. They actually did a pretty good job of moving around the trains they have, and I was only delayed about twenty minutes.Probabilities not bounded away from zero
http://notstatschat.netlify.com/2017/07/09/probabilities-not-bounded-away-from-zero/
Sun, 09 Jul 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/07/09/probabilities-not-bounded-away-from-zero/We have a population or cohort of \(N\) people divided into \(H\) sampling strata, with a sample of size \(n_h\) taken from the population \(N_h\) in stratum \(h\). Let \(\pi_{ij}\) be the sampling probability for person \(i\) in stratum \(h\). When we do asymptotics we usually assume \(\pi_{ih}\) are bounded away from zero. That’s not ideal for, say, case-control studies of rare diseases, where we might want asymptotic approximations based on the case incidence being small (ie, converging to zero).Two-day course: survival analysis
Wed, 05 Jul 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/07/05/two-day-course-survival-analysis/Tuesday 12 and Wednesday 13 September 2017, 9am-5pm. This two-day workshop will cover data exploration, data summaries, and regression modelling for time-to-event data. There will be both lecture and practical sessions.
Topics:
Concepts: censoring, truncation, competing risks, choice of time scale
Summaries: the Kaplan–Meier curve; mean, median, and proportion surviving; the hazard rate; graphical exploration Two-sample testing: the logrank test and its strengths and weaknesses The proportional hazards model: right censoring, left truncation, Time-varying predictorsA possibly unsurprising bootstrap observation
Sun, 11 Jun 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/06/11/a-possibly-unsurprising-bootstrap-observation/Suppose you have a finite population modelled as a realisation of some probability model with potentially complicated spatial structure, and a multistage sample taken with some different structure. For example, suppose you have a genetic linear mixed model with ancestry and relatedness structure, but you sample people by census block group and household. It is either blindingly obvious or really surprising (or both?) that the sampling component of the standard error doesn’t depend on the structure of the model.Stupid word games
Mon, 05 Jun 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/06/05/stupid-word-games/Today, Jeroen Ooms announced the appearance on CRAN of an R package for language detection, wrapping the “CLD2″ compact language detector. Obviously, given a tool like that on a holiday long weekend, my first reaction was to try to confuse it.
Two fun games to play with a language detector:
Find an obviously English sentence (ideally a quote) that it doesn’t recognise as English, and a very non-obviously English sentence that it doesPipeable survey analysis in R
Mon, 29 May 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/05/29/pipeable-survey-analysis-in-r/Today, I accidentally found out about the ‘srvyr’ package, which is a wrapper for my ‘survey’ package to make it work with %>% pipes and dplyr and so on. Yay!
R has a package discovery problem. I wouldn’t say I’m the most plugged-in of R users, but there must be a reasonable fraction who would be even less likely than me to find out about it. Even though the ‘survey’ package design sticks fairly close to ‘tidy data’ principles, the fact that it uses different conventions from the `tidyverse’ packages means that there’s a whole lot of adaptor code needed.Peer review and community endorsement
Mon, 22 May 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/05/22/peer-review-and-community-endorsement/In most academic fields there are journals it’s easy to publish in. Some of these are outright scams, but some are just not that fussy about the importance of results. In the experimental sciences, being able to publish negative or otherwise uninteresting results can be very important. Even in fields where ideas, rather than data, are important, being able to get research out into libraries is valuable – though preprint servers such as arXiv are now filling that niche.A ‘polymath’ project on the Wilcoxon test?
Fri, 12 May 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/05/12/apolymath-project-on-the-wilcoxon-test/`Polymath’ is a set of projects in massive collaborative proof of mathematical results; Terry Tao and Timothy Gowers are two of the famous mathematicians involved. There’s a new potential project on Gowers’s blog, which he describes a being related to intransitive dice. As you know, if you read this blog, (a) I prefer the term non-transitive, and (b) this means it’s about the Wilcoxon test.
The idea of the conjecture is that you define an \(n\)-sided die by sampling uniformly with replacement \(n\) numbers from \(1, 2,3,\dots,n\) as the numbers on the sides, with the constraint that the numbers have to add up to \(n(n+1)/2\).Value of a degree
Mon, 01 May 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/05/01/value-of-a-degree/Today was graduation day for Science students at the University of Auckland. At each graduation, the Chancellor of the University gives an introduction that includes (for example, here)
We know that, compared to those whose formal education ends in high school, graduates have lower unemployment rates, higher salaries, better career prospects, and better health outcomes.
I’d hope that a university degree would give students the tools to think about claims like that.Prerequisites
Wed, 29 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/29/prerequisites/This week, John Myles White tweeted
One meme I wish would die off: the belief that we can teach high school students statistics without teaching them calculus.
Statistics Twitter was immediately divided between “Preach it, brother!” and “Not cool, dude.” I’m mostly, but not entirely, in the latter camp. Personally, I did study calculus before taking up statistics, and it helped. In fact, I studied tensor calculus, functional analysis, measure theory, group theory, and differential topology before taking up statistics.Come work with us
Tue, 28 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/28/come-work-with-us/The Statistics Department at the University of Auckland is looking for three new academics. We have two entry-level positions[1], and one mid-level to senior position[2]. Formal ad here:
(https://www.flickr.com/photos/yaranaika/5612799116)
The department has a fairly broad view of what statistics is about: we have probabiliists (both theoretical and applied), researchers in mainstream statistical methods, in astrostatistics, in genomics, in stats education, in statistical ecology, in forensic statistics, and (famously) in statistical computing.Flat Earthers
Mon, 27 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/27/flat-earthers/The world isn’t a flat rectangle. We’ve got to the stage where most people accept this. It’s especially easy in New Zealand, where we know you can fly in a wide variety of directions and still end up in Europe after about the same time in the air. Since the world isn’t a flat rectangle, all flat rectangular maps have to be badly wrong somehow. Recently, Boston public schools have shifted from the badly-wrong Mercator projection to the differently-wrong Gall-Peters projection.Why I like the Convolution Theorem
Mon, 27 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/27/why-i-like-the-convolution-theorem/The convolution theorem (or theorems: it has versions that some people would call distinct species and other would describe as mere subspecies) is another almost obviously almost true result, this time about asymptotic efficiency. It’s an asymptotic version of the Cramér–Rao bound. Suppose \(\hat\theta\) is an efficient estimator of \(\theta\) and \(\tilde\theta\) is another, not fully efficient, estimator. The convolution theorem says that if you rule out stupid exceptions, asymptotically \(\tilde\theta=\hat\theta+e\) where \(e\) is pure noise, independent of \(\hat\theta\).Case-control efficiency
Sat, 18 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/18/case-control-efficiency/The basic story about sampling weights and regression is fairly simple: if you don’t need the weights, using them will add noise. The standard error increase is basically proportional to the coefficient of variation of the weights, and doesn’t depend on the regression coefficients or the covariate distribution. Logistic regression in a case-control sample looks superficially as if it should be the same. The maximum likelihood estimator is unweighted logistic regression, ignoring the weights, and it’s more efficient that the estimator using sampling weights.Order and quotient topologies
Tue, 14 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/14/order-and-quotient-topologies/Over the years when I was intermittently working on the rock-paper-scissors (transitivity) problem in statistical testing, one of the confusing things was the difference between order and quotient topologies. I thought I’d write about why.
Suppose you have two-dimensional Euclidean space, with points \((x,y)\), and you decide to order points on the first coordinate, so \((x,y)\prec(z,w)\) iff \(x<z\). This gives you equivalence classes \((x,y)\sim(z,w)\) iff \(x=z\). There are two obvious topologies on the set of equivalence classes.“Meritocracy” and “public good”
Sat, 11 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/11/meritocracy-and-public-good/Sometimes a word coined with one intended meaning ends up with a very different one, and after you have fought the good fight and run the race to the finish, you need to just give up. My favourite example is “meritocracy”, a word coined like “truthiness” and “factoid” to satirically attack a social trend. It failed completely: while “truthiness” worked, and “factoid” still has some of its negative connotation, “meritocracy” now means exactly the concept that it attacked: the supposed ideal of stack-ranking based on a straightforward one-dimensional metric for merit.Hearing things
http://notstatschat.netlify.com/2017/03/05/hearing-things/
Sun, 05 Mar 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/03/05/hearing-things/I spend more time than I like on aeroplanes, so I thought I’d write something about my experiences with headphones.
1. Having something to cut out the engine noise makes a noticeable difference to how much air travel sucks. Maybe as much as 10%.
2. When I first started to teach R courses for money, in 2001, I bought a pair of Bose noise-cancelling headphones. These make a big difference, and it’s easy to listen to music with them on.The Ihaka Lectures
Thu, 02 Feb 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/02/02/the-ihaka-lectures/The Stats department at the University of Auckland is inaugurating a public lecture series, named to honour Ross Ihaka, who is planning to retire this year. We’re having four lectures, with speakers chosen to represent a wide range of areas where statistical computing and graphics is important. Wednesday, March 8: Hadley Wickham (Chief Scientist, RStudio; (honorary) Associate Professor, University of Auckland). Hadley did an MSc in Statistics here in Auckland and a PhD with Di Cook’s statistical graphics group at Iowa State University.When the bootstrap doesn’t work
Wed, 01 Feb 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/02/01/when-the-bootstrap-doesnt-work/The bootstrap always works, except sometimes. By ‘works’ here, I mean in the weakest senses that the large-sample bootstrap variance correctly estimates the variance of the statistic, or that the large-scale percentile bootstrap intervals have their nominal coverage. I don’t mean the stronger sense that someone like Peter Hall might use, that the bootstrap gives higher-order accurate confidence intervals. So the bootstrap ‘works’ for the median, even though not as well as for smooth functions of the mean.Te Reo Māori in schools
Tue, 31 Jan 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/01/31/te-reo-m%C4%81ori-in-schools/Having Te Reo Māori taught as part of the standard curriculum in NZ schools seems like a reasonable idea to me. A few reasons:
1. Learning more than one language is good for understanding grammar and pronounciation, and it doesn’t matter a lot which one. “Grammar” is, almost by definition, the set of rules for correct sentences that native speakers follow most of the time without thinking, so it’s hard to talk and think about grammar sensibly if you’ve never tried to produce correct sentences in another language.Case-control sampling and pseudo-Rsquareds
Fri, 27 Jan 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/01/27/case-control-sampling-and-pseudo-rsquareds/So, I have been asked a few times how to compute \(R^2\) for models fitted to survey data. Initially the questions were about the ordinary linear-regression \(R^2\), which is easy because it’s the ratio of two variances, and we can estimate variances. More recently, people have been asking about the Nagelkerke pseudo-\(R^2\) in logistic regression. It’s not immediately obvious how to define the Nagelkerke \(R^2\) under complex sampling. My approach was to consider the Cox–Snell \(R^2\) that precedes it, which is an estimate of a well-defined population quantity: \(\log (1-R^2)\) is the mutual information between the predictors and outcome.A bus-watching bot
Tue, 17 Jan 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/01/17/a-bus-watching-bot/When it’s up, the account @tuureiti on Twitter tweets a summary of the state of Auckland buses – at the moment, every 15 minutes.
Q: Can you explain that picture?
A: Every bus that the Auckland Transport GTFS feed knows about has a dot on the graph. The GTFS feed has a ‘delay’ field that says how far ahead or behind schedule the bus is, separated by whether the next event is ‘arrival’ or ‘departure’.Mature and premature optimisation
Thu, 12 Jan 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/01/12/mature-and-premature-optimisation/Earlier this week I wrote some code that wasted 90% of its time moving data around in memory, because I just ‘grew’ a long vector with the idiom > stuff<-c(stuff, morestuff) Here’s the github commit that changed the code.
I’m writing about it because it illustrates a few useful points. First, the inefficient code was absolutely the right choice initially. I didn’t know how long each additional vector would be, and while I could have worked it out in principle, in practice I would quite likely have got it wrong.Fixing an infelicity in ‘leaps’
Mon, 09 Jan 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/01/09/fixing-an-infelicity-inleaps/The leaps package for R is ancient – this is its tenth twentieth year on CRAN. It uses old Fortran code by the Australian computational statistician Alan Miller. The Fortran 90 versions are on the web, but Fortran 90 compilation with R wasn’t portable back then, so I used the older Fortran 77 version. The main point back in 1997 was to provide a version of the leaps() function in S, which uses a branch-and-bound algorithm to do exhaustive search for the best (smallest residual-sum-of-squares) model of each size.Learning the Monty Hall problem
Tue, 03 Jan 2017 00:00:00 +0000http://notstatschat.netlify.com/2017/01/03/learning-the-monty-hall-problem/As Wikipedia gives it
Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?The ‘iris’ data
Fri, 30 Dec 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/12/30/theiris-data/Fisher’s famous ‘iris’ data set is a convenient example because it’s small and low-dimensional and has very marked differences between groups. These characteristics also make it a bad example (edit: at least for modern machine learning), because the behaviour of small, low-dimensional classification problems is a very poor guide to the behaviour of large or high-dimensional ones. That’s all obvious. What’s less well known is that the data set is an example of pseudocontext in education.Making survey statistics boring and inefficient
Wed, 23 Nov 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/11/23/making-survey-statistics-boring-and-inefficient/Last night, Alastair Scott was awarded the Jones Medal by the Royal Society of New Zealand. The medal, named after Vaughan Jones, is for lifetime achievement in the mathematical sciences. Alastair made contributions to both theoretical and applied statistics in two main areas, as the title of this post indicates. With Jon Rao and others (including me), he worked on making design-based inference boring, and with Chris Wild and others, on making it inefficient.Brief quake summary for overseas people
Mon, 14 Nov 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/11/14/brief-quake-summary-for-overseas-people/There was a pair of big earthquakes in New Zealand last night (late Sunday morning UTC, just after midnight Monday NZ time). They were about half-way between Wellington and Canterbury, in the northeast of the South Island. There have been a lot of smaller related quakes as well.
Auckland and Dunedin are unharmed. Christchurch was shaken but not seriously damaged; some people were evacuated because of the potential for a big tsunami.Changes in turnout and preference
Thu, 10 Nov 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/11/10/changes-in-turnout-and-preference/So, as you know, Hillary Clinton narrowly lost the Electoral College and probably narrowly won the popular vote. And there’s lots of theorising about how these huge swings came about and what they mean. An important first step is to think about how big the swings really were.
Here are some graphs of county-level votes in 2012 and 2016. In all the graphs, the number of votes for the candidate is scaled by the 2012 total for the county, and is then weighted by that same 2012 total.Cuts to ‘Growing Up in New Zealand’
Tue, 18 Oct 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/10/18/cuts-togrowing-up-in-new-zealand/The NZ cohort study ‘Growing Up in New Zealand’ is being cut from 7000 children to 2000, according to a story on Stuff today. That’s unfortunate – birth cohort studies are something New Zealand has done well in the past, and this is a cohort for the modern New Zealand. Obviously, the top priority for the study will have been to fight the cuts or at least try to moderate them.Terms to eschew
Wed, 12 Oct 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/10/12/terms-to-eschew/“I have discovered something else,” I continued. “By flipping the pages at random, and putting my finger in and reading the sentences on that page, I can show you what’s the matter – how it’s not science, but memorizing, in every circumstance. Therefore I am brave enough to flip through the pages now, in front of this audience, to put my finger in, to read, and to show you.” Richard Feynman, at a public lecture in BrazilLarge quadratic forms
Tue, 27 Sep 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/09/27/large-quadratic-forms/Attention Conservation Notice: there probably aren’t as many as half a dozen groups in the world who actually have this much genome sequencing data. Everyone else could wait to see if something better comes up.
If you follow me on Twitter, you will have seen various comments about eigenvalues, matrices, and other linear algebra over the past months. Here, finally, is the Sekrit Eigenvalue Project. A quadratic form in Normally distributed variables is of the form \(z^TAz\) where \(z\) is a vector of \(n\) standard Normals and \(A\) is an \(n\times n\) matrix.The hard problem of AI and other stories
Thu, 22 Sep 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/09/22/the-hard-problem-of-ai-and-other-stories/Another occasional SF/F post.
Amazon now has a lot of older Melissa Scott novels in Kindle format. In the old days, Melissa Scott was known for forthrightly LBGTQ fiction. After a decade or two, there’s been enough social progress for that to not be the most obvious thing about her writing. The ‘hard problem of consciousness’ is a term of art in philosophy of mind. It’s either the most important question about intelligence, or a purely linguistic distraction from the real issues.Come work with us
Wed, 07 Sep 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/09/07/come-work-with-us/The Statistics department at the University of Auckland is advertising for a Professional Teaching Fellow, following the retirement of existing staff. This is a full-time, permanent, academic staff position in a department that understands how much its success depends on high-quality teaching.
Aerial view of Auckland by Flickr user Craig, annotated to show Stats Dept
The formal ad is on Seek. The department is seeking to appoint a highly organised, energetic and collegial person for the role of Professional Teaching Fellow.On permuting all the things
Tue, 06 Sep 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/09/06/on-permuting-all-the-things/I wanted to list all the numbers whose digits were some permutation of 2,2,5,5,9,9, and find how many of them were multiples of 11, and how many were prime. (Because of Evelyn Lamb’s comment on the prime number 295259 produced by the prime numbers twitter bot)
It takes some thought to work out how to list those numbers exactly once (because of the duplicated digits) but no thought at all to work out how to generate a random sample of them and discard duplicates.The lithium-powered space bike
Sun, 04 Sep 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/09/04/the-lithium-powered-space-bike/Q: So, it’s been about 11 months since you got your fancy electric-assist bike
A: Yes, that’s right
Q: Have you given up yet?
A: No, it’s still fun.
Q: Even with the rain?
A: Combining Doppler radar and the detailed weather forecasts has mostly kept me dry
Q: And getting killed by cars?
A: So far, still at less than 1 event. Q: How do you feel about busy two-lane roundabouts?“The” multiple comparisons problem
Sat, 27 Aug 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/08/27/the-multiple-comparisons-problem/Andrew Gelman posted recently with the title “Bayesian inference completely solves the multiple comparisons problem”. Bayesians have been making a claim that sounds like this for many years, so it would be easy to misunderstand and think he was making a much weaker claim than he actually is. There are at least two multiple comparisons problems, andI’d like to suggest some terminology:
The first-person multiple comparisons problem: I have data relevant to a collection of parameters \(\{\theta_i\}_{i=1}^N\) and I want to make sure I arrive at sensible beliefs or take sensible decisions even if \(N\) is quite largeLike a crossword
Sat, 20 Aug 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/08/20/like-a-crossword/The philosopher of science Susan Haack has a lovely analogy for the interconnectedness of scientific ideas: the crossword puzzle. We’re talking something along the lines of the New York Times crossword, not a British-style cryptic: the clues for each entry are often insufficient taken one at a time, but a false answer is likely to be revealed by its failure to fit with crossing answers.
Chris McDowall recently reminded me of the Phantom Time Hypothesis, my favourite engagingly batshit historical theory.Simulations and modes of convergence
Sun, 14 Aug 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/08/14/simulations-and-modes-of-convergence/We often have theory that says \[\sqrt{n}(\hat\theta_n-\theta)\stackrel{d}{\to}N(0,\sigma^2),\]
and then do simulations to see how well the asymptotic approximation applies. After doing so, we often present tables of the empirical mean and standard deviation of \(\hat\theta_n.\) This doesn’t make a lot of sense.
Knowing that \(\sqrt{n}(\hat\theta_n-\theta)\stackrel{d}{\to}N(0,\sigma^2)\) doesn’t tell us anything about the moments of \(\hat\theta_n\) for any finite \(n\). Convergence in distribution does not imply convergence in mean. For example, \(\hat\theta_n\) could be maximum likelihood estimates in a logistic regression model.Etymology
Tue, 02 Aug 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/08/02/etymology/Penguin: the name is supposed to come from the Welsh pen gwyn, meaning ‘white head’. Since penguins have black heads, and do not live within 10,000 km of Wales, it is difficult to see how this theory arose.A modest proposal: Lazy Ambiguous Single Transferable Vote
Fri, 29 Jul 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/07/29/a-modest-proposal-lazy-ambiguous-single-transferable-vote/We’re about to have another outbreak of voting here in NZ as well. The local government elections use STV, and Graeme Edgeler explains it here. In particular, he explains how indicating preferences for all the candidates, even ones you don’t want to win, is desirable.
Because Twitter is Twitter, a discussion of this came up with Rob Salmond’s proposal that you should be able to vote 1,2, 3, , 35,36 for, say, the District Health Board elections where there are a few good candidates who are worth voting for, a couple of antifluoride or antivax extremists who need to be voted against, and a lot of boring and irrelevant candidates.One scoRe years
Thu, 28 Jul 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/07/28/one-score-years/It’s always nice when even imperfect metrics make you look good. The new programming-language rankings in IEEE Spectrum are out. I don’t think I believe their weighting system, but it has R in 5th place! Since last year, we’ve just edged out C#.
Bob Muenchen looks at statistical software citations on Google Scholar, and finds that R is narrowly in front of SAS – and though SPSS is well ahead, it’s headed down.How do we prove the Central Limit Theorem?
Mon, 04 Jul 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/07/04/how-do-we-prove-the-central-limit-theorem/More precisely, in a course in mathematical statistics that’s trying not to assume more mathematics than necessary, how do we prove it? A (Weak) Law of Large Numbers is easy: Markov’s inequality, then Chebyshev’s Inequality, not needing anything more than the simplest manipulation of expectations. The CLT is hard.
The standard approach is to use characteristic functions: prove Levy’s Continuity Theorem, work out what the characteristic function of an iid sum looks like, and then work out the characteristic function of a Normal.Computing the (simplest) sandwich estimator incrementally
Sat, 04 Jun 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/06/04/computing-the-simplest-sandwich-estimator-incrementally/The biglm package in R does {incremental, online, streaming} linear regression for data potentially larger than memory. This isn’t rocket science: accumulating \(X^TX\) and \(X^TY\) is trivial; the package just goes one step better than this by using Alan Miller’s incremental \(QR\) decomposition code to reduce rounding error in ill-conditioned problems. The code also computes the Huber/White heteroscedasticity-consistent variance estimator (sandwich estimator). Someone wants a reference for this. There isn’t one, because it’s too minor to publish, and I didn’t have a blog ten years ago.Are there any news?
Fri, 03 Jun 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/06/03/are-there-any-news/I’ve written before about treating ‘data’ as a plural count noun or a mass noun. In most settings I’m happy with either: ‘this data’ or ‘these data’; ‘data is’ or ‘data are’. There are settings where the count version doesn’t feel right to me. I might write “We don’t have much data on that issue”, but never “We don’t have many data on that issue”. Perhaps the most extreme is the opposite of ‘more data’: ‘fewer data’ just seems wrong.Size matters
Thu, 14 Apr 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/04/14/size-matters/There’s a lovely demonstration of simple neural networks at playground.tensorflow.org, which I recommend to anyone interested in teaching or studying them. It shows the inputs, the hidden nodes, and the output classification, and how they change with training. You can add more neurons or more layers interactively, and fiddle with the training parameters. I wish something like this had been available in the early 90s when I was learning about neural networks.Sufficiently advanced technology
Sun, 10 Apr 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/04/10/sufficiently-advanced-technology/As I said on Twitter, I found out this week (1) that there are cheap variable resistors sensitive to acetone (or ethanol) and (2) that many people, even scientists, don’t think this is amazing. Multi-atom molecules are hugely bigger than electrons, but hugely smaller than the bulk semiconductor. They shouldn’t be able to affect resistance.
Other technologies for interfacing chemistry and electronics tend to be based on light absorption (breathalysers, blood oxygen detectors, some DNA sequences) or on electrons/protons released in chemical reactions (other breathalysers, other DNA sequencers) or in past days on formation of ions in solution.The Great Kiwi Cherry Ripe Scandal
Tue, 29 Mar 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/03/29/the-great-kiwi-cherry-ripe-scandal/In which I unnecessarily calculate a simple probability by maths when I’ve already done it by simulation.
You can just see it on a maths teaching blog as a Bad Example
A company is making packs of eight chocolate bars chosen independently and with equal probability from five types: “Cherry Ripe”,“Dairy Milk”,“Crunchie”, “Caramello”, and “Flake”. What is the probability that a pack will contain seven or more Cherry Ripes?Mostly dead
Mon, 28 Mar 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/03/28/mostly-dead/Inigo Montoya: He’s dead. He can’t talk.
Miracle Max: Whoo-hoo-hoo, look who knows so much. It just so happens that your friend here is only MOSTLY dead. There’s a big difference between mostly dead and all dead. Mostly dead is slightly alive. In the cardiovascular-research trade, there’s a minor but persistent issue of nomenclature. When your heart stops beating and you fall over dead, should that be called “Sudden Cardiac Death” or “Sudden Cardiac Arrest”?Artistic verisimilitude
Thu, 24 Mar 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/03/24/artistic-verisimilitude/Scene: the back room of a gas station not far from a Research 1 university, late twentieth century. “Chris? You got a minute”
It’s Chris’s ‘lunch’ break, which by natural justice and state law should be the time for daydreaming about attractive members of the appropriate sex while trying to start an assignment on Banach spaces. “The boss read an in-flight magazine again” Chris sighs. Having the owner away for a week was good, but he always seemed to get…ideasThe conservative Bonferroni correction
Sun, 20 Mar 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/03/20/the-conservative-bonferroni-correction/It seems to be a surprise to most people (certainly to me) how sharp the Bonferroni correction is when the number of tests is large. Unless the correlation between tests is really, high, the actual family-wise Type I error rate is very close to the nominal rate \(\alpha/k\).
Part of the issue is confusing prior distributions on effect sizes (which can be quite strongly correlated) with null sampling distributions (which tend to be weakly correlated in the extreme tails).Trace estimators and impact factors
Tue, 15 Mar 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/03/15/trace-estimators-and-impact-factors/For a Secret Project™, I needed a quick estimator of the trace of a matrix. To be precise, I have a rectangular matrix \(A\) and I needed \(\mathop{tr}(B)\) and \(\mathop{tr}(B^2)\) where \(B=A^TA\). That sounds easy, but \(A\) is big enough that you don’t want to compute \(A^TA\). The first one actually is easy: \[\mathop{tr}(B)=\sum_{ij}(A_{ij})^2.\] The second one is harder. I tried a sampling approach: estimating a sample of the entries of \(B\) and using \[\mathop{tr}(B^2)=\sum_{ij} (B_{ij})^2.A gene for celibacy?
Sun, 13 Mar 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/03/13/a-gene-for-celibacy/Tyson is, unusually for him, completely wrong here. I’ll ignore the use of “gene” to mean “allele”, since that’s a plain-English abuse of notation as harmless as calling Pluto a planet. Put more precisely, he’s saying “if you have a genetic variant (allele) that substantially increases your tendency to celibacy, you didn’t inherit it” The first problem is a statistical one. It would be surprising if you inherited a ‘celibacy gene’, but it would also be surprising if you got it by de novo mutation.Truthy and Sciency
Wed, 02 Mar 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/03/02/truthy-and-sciency/There was a story at the New Zealand Herald, republished without attribution from TheConversation, under the headline “People in their 90s reveal secret to ageing well”. By and large, it’s pretty good example of what The Conversation is trying to do, but there are some strange bits, such as
Regular exercise changes our epigenome, activating genes that improve muscle function and
Few participants smoked, avoiding the known epigenetic effects of cigarette smoke including lung damage, increased risk of dementia and cancer.Coding linear splines
Mon, 29 Feb 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/02/29/coding-linear-splines/Attention conservation notice: anyone who would actually use this could just sit down and do the algebra almost as quickly.
The best-known splines are cubic: a cubic spline with knots \(x_1,\;x_2,\dots,\;x_m\) is a piecewise-cubic polynomial \(f(x)\) where \(f\), \(f’\), and \(f’’\) are continuous at the knots. The name is from the engineer’s drafting tool, a flexible metal strip that – in the infinitely-thin, uniformly flexible asymptote – will form a curve held down at the knots and otherwise minimising bending energy \(\int f’’(x)^2\,dx\) to give a cubic spline.Cheap tricks
Sun, 28 Feb 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/02/28/cheap-tricks/If you’re interested in thinking about evidence and belief and rhetoric, it’s convenient to have uncontroversial examples of ‘cheap tricks’ that directly affect attitudes. Ian Gordon has provided one, with his translation of the ‘Imperial March’ from Star Wars into a major key.
The tune is recognisably still film music by John Williams in a military style, but it’s happy, and lively, and on the edge of self-parody: somewhere between ‘The Great Escape’ and ‘Chicken Run’.Two cheers for crowdfunding
Fri, 26 Feb 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/02/26/two-cheers-for-crowdfunding/A successful large crowdfunding effort in a medium-sized community is, ipso facto, widely popular. If roughly 40,000 people have donated to buy a beach and sandbar, they’re going to be proud of themselves and not want the effort criticised. And it’s hard to argue that the two million dollars spent on Awaroa beach is a worse use of money than, say, the twenty million spent in a typical week on the lottery.No-one’s forcing you to read the Herald
Sun, 07 Feb 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/02/07/no-ones-forcing-you-to-read-the-herald/“No-one’s forcing you to read the Herald”
(various sources on the internet)
It’s true. No-one is forcing me to read the NZ Herald. In fact, I went forty years without reading it and no-one criticised at all. No-one forced me to move to New Zealand, either.
You probably wouldn’t want the Herald to be your only source of news, but for someone living, working, and voting in Auckland, the Herald is the best available paper.Stochastic SVD
Fri, 05 Feb 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/02/05/stochastic-svd/Suppose you have an \(m\times n\) matrix \(A\) of rank \(k\). If \(\Omega\) is an \(n\times k\) matrix with iid standard Gaussian entries, then \(\Omega\) will have rank \(k\) with probability 1, \(A\Omega\) will have rank \(k\) with probability 1, and so \(A\Omega\) spans the range of \(A\). That’s all easy.
More impressively, if \(A=\tilde A+\epsilon\) where \(\tilde A\) has rank \(k\) and \(\epsilon\) has small norm, and if \(\Omega\) has \(k+p\) columns, \(A\Omega\) spans the range of \(\tilde A\) with high probability, for surprisingly small values of \(p\).Is it that time of day?
Wed, 20 Jan 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/01/20/is-it-that-time-of-day/Wade at Minding Data wrote about a local NZ radio station
One of the main criticisms of The Rock, is that even if it doesn’t play the same song between 9 – 5, it still plays the same song everyday, often at the same time. To be fair to them, it’s probably no different to the criticism hurled at any popular radio station really. Anecdotal, I used to listen to the radio as I was getting up in the morning, and I used to swear that for weeks on end, I would be getting up to the same song.Another view of the ‘nearly true’ model
Wed, 13 Jan 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/01/13/another-view-of-thenearly-true-model/Ok, so to recap, we have a large model (such as ‘we know the marginal sampling probabilities’) and a small model (such as the subset of the large model with \(\mathrm{logit}\,P[Y=1]=x\beta\)). Under the large model, we would use the estimator \(\hat\beta_{L}\), but under the small model there is a more efficient estimator \(\hat\beta_S\). That is, under the small model
\[\sqrt{n}(\hat\beta_S-\beta_0)\stackrel{d}{\to}N(0,\sigma^2)\]
and
\[\sqrt{n}(\hat\beta_L-\beta_0)\stackrel{d}{\to}N(0,\sigma^2+\omega^2)\]
We’re worried that the small model might be slightly misspecified.What does ‘design-consistent’ even mean?
Wed, 13 Jan 2016 00:00:00 +0000http://notstatschat.netlify.com/2016/01/13/what-does-design-consistent-even-mean/In classical survey statistics you have a fixed finite population of size \(N\) and a (possibly unequal-probability, multistage) sample of size \(n\). Useful asymptotics requires an infinite sequence of populations and samples chosen so that approximation errors from neglecting terms that decrease in \(n\) and \(N\) are practically unimportant in the real data when they are asymptotically negligible in the infinite sequence. For ‘model consistency’ this is easy. An estimator \(\hat\theta_n\) is model consistent if \(\hat\theta_n\stackrel{p}{\to}\theta_0\) when the population of size \(N\) is a sample from a model \(P_\theta\) with parameter \(\theta=\theta_0\), for all designs obeying regularity conditions to be described in the proof.Circumspice
Thu, 31 Dec 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/12/31/circumspice/Norman Breslow died early this month. If you’ve had any involvement with medical statistics you have used his work. There isn’t really any need to expound on his contributions. I have a few Norm memories.
In my first quarter at the University of Washington, I took BIOST 570 (generalised linear models) from Norm. One day, about halfway through the quarter, he appeared with a copy of ‘Science’ and asked me why I hadn’t been a co-author on a paper from the Sydney Blood Bank.Superfood sourcing
Wed, 30 Dec 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/12/30/superfood-sourcing/Because reasons, I ended up looking at a website for a new superfood, “The Hawaiian Coffeeberry ®”. ““The Hawaiian Coffeeberry ®”, of course, isn’t a Hawaiian plant – it’s coffee, orginally from Ethiopia – but at least the marketing is Hawaiian. Well, the parts that aren’t unsourced copying from various internet sites.
Here’s a description of what they think the dominant nutrients are:
The chlorogenic acid bit is ok, though they don’t mention it’s also found in, eg, peaches, and eggplant, and potatoes.The Muntab Question Strikes Back
Thu, 24 Dec 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/12/24/the-muntab-question-strikes-back/Public Policy Polling, which is known for adding questions in surveys to exploit Republicans who are less informed, recently found that 30% of Republican voters would support bombing Agrabah, a fictional country in the Disney film Aladdin. On December 20, 2015, WPA Research fielded a national survey of 1,132 registered voters that found 44% of Democrats would support …..
As you know, I think the Agrabah bombing question was misleading and unhelpful to public political discourse.Potential energy and kinetic energy
Tue, 22 Dec 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/12/22/potential-energy-and-kinetic-energy/Q: So, I hear you have a new bike?
A: Yes, it’s electric.
Q: One you don’t need to pedal?
A: No, you do still need to pedal, but the motor helps
Q: Doesn’t that defeat the purpose of cycling?
A: Well, that rather depends on what you think the purpose of cycling is. Q: Do you want to expound?
A: Why, yes, thank you! Cycling is a great way to travel moderate distances on flat ground.Case-control estimation is more complicated than you think
Sun, 20 Dec 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/12/20/case-control-estimation-is-more-complicated-than-you-think/Well, obviously I don’t know how complicated you think it is, but it’s more complicated than I thought, and more complicated than my colleagues thought.
In a case-control design you sample all the cases (\(Y=1\)) and a fraction \(\pi_0\) of the controls (\(Y=0\)) from a cohort. You could fit a logistic regression with sampling weights (\(1\) for cases, \(1/\pi_0\) for controls). Or, you can fit an unweighted logistic regression, which should be biased except that all the bias ends up in the intercept term and doesn’t affect the regression coefficients.A simple probability problem
Mon, 14 Dec 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/12/14/a-simple-probability-problem/Amy Hogan, a stats and maths teacher who blogs at A Little Stats, posted the following quiz on twitter:
(Assuming fair dice) which has the highest probability:
1 six from 6 dice
2 sixes from 12 dice
3 sixes from 18 dice
The calculations aren’t too hard even by hand, and we have pbinom() available (if we remember to check \(<\) vs \(\le\) conditions). In that sense the question is easy, but I was looking for an intuitive argument.The Muntab Question
Mon, 14 Dec 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/12/14/the-muntab-question/In a survey by Public Policy Polling, 30% of Republican-leaning and 20% of Democratic-leaning people said they supported bombing the fictional country of Agrabah. I’ve written on StatsChat why I think these are deliberately misleading percentages and asking the question amounts to pissing in the swimming pool of public discourse. What I want to say here is that the “Don’t Know”s aren’t getting nearly enough stick.
Now, some of the “Don’t Knows” will have been successfully trolled by PPP and will think Agrabah is actually the name of somewhere in Syria or Iraq where bombing is a genuine question, and a subset of these will legitimately not have given enough attention to the question to be sure.Serious tongue-twister
Fri, 27 Nov 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/11/27/serious-tongue-twister/This is a flow-chart I put together, that may go in the documentation for the Health Research Council data monitoring committee. Obviously the questions in boxes are simplified and would need expansion in the text.
Most of the work of the data monitoring committee, and the work that the study does on our behalf, is concentrated at the twice-yearly meetings. Some things, though, can’t safely be left to accumulate for six months, so there’s urgent reporting of sufficiently noteworthy clinical events to a clinician on the data monitoring committee.Poetry visualisation
Sat, 14 Nov 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/11/14/poetry-visualisation/So, I was trying to write something serious last night about communicating probabilities for my talk to journalists next weekend, but it was a depressing day, so instead I did something frivolous that was vaguely related.
Wisława Szymborska won the 1996 Nobel Prize for Literature. I first encountered this poem when mathematician Evelyn Lamb linked to it at JoAnne Growney’s blog Poetry with Mathematics.Should SPRINT have stopped?
Tue, 10 Nov 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/11/10/should-sprint-have-stopped/The SPRINT trial comparing standard blood pressure treatment (to 140mmHg) with intensive treatment (to 120mmHg) stopped early in September and just published this week. Hilda Bastian wrote about the problems of early stopping back in September.
Now the results are out, we can see a lot more detail on what was going on. I think the right decision was made, but it’s not completely straightforward. Also, I’m only a simple country statistician, so I may be missing some issues.Prefiltering very large numbers of tests
Mon, 19 Oct 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/10/19/prefiltering-very-large-numbers-of-tests/Genome-wide association studies involve lots of analyses. Nearly always they involve lots of tests. Also, in contrast to gene expression studies or to state-specific estimates of political attitudes or small-area disease rate estimates, a lot of the null hypotheses are effectively true. That is, most single-nucleotide polymorphisms are so close to not having any effect on anything that we might as well call it zero. Most people express this in terms of the need for stringent Type I error control; Bayesians like Matthew Stephens might talk in terms of the very low prior probability of a non-negligible effect.Double robustness
Sun, 18 Oct 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/10/18/double-robustness/“Double robust” estimation in a regression problem uses a model for the outcome \(Y\) given available data \(Z\) and a model for the exposure \(X\) given available data \(Z\). The estimates are consistent if either model is correct and efficient if they both are correct[1].
Described that way, double robustness doesn’t sound very useful. “All models are wrong; many models are useless”, as we can deduce from Box’s familiar aphorism, so the chance of one of two models being correct is no more than twice the chance of one model being correct: two times \(4/5\) of \(5/8\) of not very much[2].Convergent evolution and NZ Bird of the Year
Mon, 05 Oct 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/10/05/convergent-evolution-and-nz-bird-of-the-year/Forest & Bird is the New Zealand equivalent of the UK’s Royal Society for the Protection of Birds, or the USA’s Audubon Society. Each year, they hold a “Bird of the Year” competition, to get more publicity for NZ birds.
The competition is made possible by the relatively small number of bird species in New Zealand, partly because it’s an isolated set of islands, and partly because a depressing number of the birds are ex-species.NZ Flag Referendum pseudorandom numbers
Tue, 22 Sep 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/09/22/nz-flag-referendum-pseudorandom-numbers/The counting process for the NZ Flag Referendum needs some way to break ties. The Act defines a way to generate pseudo-random numbers (Schedule 4, clauses 14 to 22). Anyone in computational statistics who reads this will recognise some of the magic numbers; for the rest of you, here’s what’s going on.
The Act almost defines the Wichmann-Hill PRNG, a respectable, if old-fashioned algorithm that was the original RNG in R.Oranges and lemons
Mon, 21 Sep 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/09/21/oranges-and-lemons/One of the basic principles of applied statistics is that the data don’t tell you what the question is. For example, the distribution of a variable doesn’t tell you what summary statistic you are interested in.
For mean vs median, a good example is binary variables. If a variable (like the indicator variable for dying in a car crash) is 0 for most people and 1 for a few people, the variable is very highly skewed but the mean (probability) is a much more useful summary statistic than the median (zero).(high-dimensional) Space is Big.
Mon, 14 Sep 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/09/14/high-dimensional-space-is-big./Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Hitchhikers Guide to the Galaxy
There’s a simple simulation that I used in stat computing class last year and for Rob Hyndman’s working group last week. Simulate data uniformly on a \(p\)-dimensional hypercube \([0,\,1]^p\) and compute nearest-neighbour distances.Good reasons for assuming a spherical cow
Mon, 14 Sep 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/09/14/good-reasons-for-assuming-a-spherical-cow/Talks and papers in statistics often have what purports to be an application but with assumptions that look implausible. That can be fine, but you need to know, and tell us, why you’re making those assumptions.
If I ask “Why are you assuming a spherical cow?”, here are some possible good answers:
Honest theory: “It’s not really about cows. Greeble’s Conjecture is the leading open question in heterotrophic morphon theory.Net Reclassification Index: surprisingly weird.
Sat, 29 Aug 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/08/29/net-reclassification-index-surprisingly-weird./Attention Conservation Notice: Long. Really long. No, longer than that. Here: read the original instead.
The Net Reclassification Index (NRI) is a summary of improvement in prediction when new information is added, and an intuitively plausible one. Suppose that we’re trying to predict \(Y=1\) vs \(Y=0\), and that for person \(i\) we have an old predicted probability \(\hat p_{\textrm{old}}(i)\) and a new predicted probability \(\hat p_{\textrm{new}}(i)\). We’d hope that the probabilities for cases (\(Y=1\)) go up and the probabilities for controls (\(Y=0\)) go down when more information is used.A conservation tragedy
Thu, 20 Aug 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/08/20/a-conservation-tragedy/The NZ Herald is reporting that hunters taking part in a pūkeko cull on one of the islands near Auckland killed four takahē. Pūkeko (Porphyrio porphyrio) are the closest relatives of takahē (Porphyrio hochstetteri), but they aren’t all that close. Takahē are New Zealand natives, which were forced out of their wetland habitat to alpine grasslands by the Māori, and then nearly wiped out by the stoats and red deer introduced by Europeans.Colour names from XKCD in R
Thu, 20 Aug 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/08/20/colour-names-from-xkcd-in-r/Randall Munroe at XKCD did a color names survey a few years ago, and published a list of about a thousand colour names whose RGB values (averaged across his readers’ monitors) could be fairly reliably estimated.
I have finally got around to turning them into an R package. It’s only on GitHub so far. The functions are
xcolors(max_rank=-1): List the top (most commonly given) max_rank color names; analogous to colors()Fox fails statistics; does NYT?
Thu, 06 Aug 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/08/06/fox-fails-statistics-does-nyt/From Fox
Each poll has a different margin of error, and averaging requires a distinct test of statistical significance. Given the over 2,400 interviews contained within the five polls, from a purely statistical perspective it is at least 90% likely that the tenth place Kasich is ahead the eleventh place Perry.
The Upshot blog at the New York Times correctly points out that if this is a p-value, that’s not the way to interpret it.JSM2015: notes on Seattle from an ex-resident
Wed, 05 Aug 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/08/05/jsm2015-notes-on-seattle-from-an-ex-resident/Getting from the airport: Public transport. No question. Well, unless you have significant mobility problems, in which case why are you looking at travel advice from random strangers on the internet? Take the light rail to the end of the line (Westlake), then catch any bus from the same platform one stop to Convention Place. The alternatives are much more expensive, and have a fair chance of being slower.
Trolls: The Fremont Troll is under the highway bridge in Fremont.Pianos, heaps, and ethics of randomisation
Sat, 01 Aug 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/08/01/pianos-heaps-and-ethics-of-randomisation/Suppose you could make the following observations:
Ling (零) was a pianist Every pianist has a favourite student Different pianists have different favourite students Ling was not the favourite student of any pianist1 Anything that Ling knew and that every pianist teaches to his favourite student ends up known by everyone in the Ling School of Piano (it’s like martial arts) If the first three observations continue to be true, the Ling School of Piano will obviously go on forever.Te Wiki o Te Reo Māori
Mon, 27 Jul 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/07/27/te-wiki-o-te-reo-m%C4%81ori/Scene: A wetland by high-country stream, Te Wai Pounamu. Rangers from Te Papa Atawhai pick up a bedraggled hunter.
“How did you get up here? Where’s your boat”
“Not a boat. Over there.” He gestures
“Yeah? So why are you out here in the swamp?”
“Can’t go back.”
“What’s the matter, bro?” asks the bigger ranger, who’s obviously been chosen as the good cop.
“They got out of the net,” the hunter said, showing his bleeding hand is missing a fingerstringsAsFactors = <sigh>
Sat, 25 Jul 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/07/25/stringsasfactors--sigh/Problems with R can be divided into several groups:
R has the defects of its virtues: pass-by-value and deep copying make the language easy to learn, but waste a lot of memory. R is old: it’s not written in C++ and doesn’t have a 64-bit integer type because those weren’t things in 1992
Base R (and S before it) was developed for interactive use and then got extended into computational infrastructure.Pi day
Thu, 02 Jul 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/07/02/pi-day/Pi day is celebrated on March 14 in all the countries that use the MM/DD/YYYY date format (ie, the USA). Pi Approximation day is celebrated in the rest of the world on July 22. I’m proposing today for another one: π continued fraction day. Like the 22/7 festival it doesn’t depend on using base 10, and like American Pi day it is extensible when the stars align correctly. The continued fraction expansion of π isA much-needed gap
Sat, 20 Jun 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/06/20/a-much-needed-gap/There are a surprisingly large number of research papers that use the Shapiro-Wilk normality test on data from NHANES or the British Household Panel Survey, two large multi-stage surveys. This is a bad idea for multiple reasons
Testing for normality is typically a bad idea. It’s unusual for Normal/non-Normal to be an interesting question. That’s in contrast to testing for a power law in skewed data, where apparently many people are interested in the question, though fewer of them in how to answer it.Countermatching
Wed, 03 Jun 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/06/03/countermatching/Countermatching is a simple case-control sampling mechanism that makes people uncomfortable when they first encounter it. Get ready.
Suppose you want to study the effect of a relatively rare exposure (sufficiently high dose radiation to the heart) on a relatively rare outcome (heart failure in breast cancer survivors). If you just took a random sample of the population there would be very few breast cancer survivors, so you work with a cohort of breast-cancer survivors.Zero-inflated Poisson from complex samples
Tue, 26 May 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/05/26/zero-inflated-poisson-from-complex-samples/A very long post about how to add models to the survey package; specifically, the zero-inflated Poisson.
The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if \(Z\) is Bernoulli with probability \(1-p_0\) and \(P\) is Poisson with mean \(\lambda\) then
\[Y=Z+(1-Z)P\]
is zero-inflated Poisson. The ZIP is a latent-class model; we can have \(Y=0\) either because \(Z=0\) or because \(P=0\).Call me, Ishmael
Wed, 20 May 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/05/20/call-me-ishmael/Making small changes in text to escape plagiarism-detection software is challenging but not really difficult. The relationship between your text and the urtext in the software is precise and syntactic; the software doesn’t know about ideas. Making small changes in a text that change or obscure the meaning is also easy, and is why copyeditors exist. Making small changes in a text that give an interesting new meaning is enormously harder, as in the opening of Peter de Vries’ “The Vale of Laughter”, quoted as the title of this post.Superefficiency
Tue, 12 May 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/05/12/superefficiency/If you have \(X_1,\ldots,X_n\) independent from an \(N(\mu,1)\) distribution you don’t have to think too hard to work out that \(\bar X_n\), the sample mean, is the right estimator of \(\mu\) (unless you have quite detailed prior knowledge). As people who have taken an advanced course in mathematical statistics will know, there is a famous estimator that appears to do better. Hodges’ estimator is given by \(H_n=\bar X_n\) if \(|\bar X_n|>n^{-1/4}\), and \(H_n=0\) if \(|\bar X_n|\leq n^{-1/4}\).Precise answers, but not necessarily to the right question
Mon, 04 May 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/05/04/precise-answers-but-not-necessarily-to-the-right-question/Nicholas Schork has a commentary at Nature about precision medicine, arguing in favour of n-of-1 trials. These are the extreme version of crossover trials: you randomise each individual to a long sequence of periods on each of two treatments and see which they do better on. The idea makes sense: you get genuinely individual-specific results for people in the study, and the ability to aggregate them to generalise to people not in the study.What’s the right proof of the Continuous Mapping Theorem?
Sun, 03 May 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/05/03/whats-the-right-proof-of-the-continuous-mapping-theorem/The Continuous Mapping Theorem says that if \(X_n\stackrel{d}{\to}X\) and \(f\) is continuous except at a set of points with zero probability under \(X\), that \(f(X_n)\stackrel{d}{\to}f(X)\). As David Pollard points out, it should be called the almost-everywhere-continuous mapping theorem, because the ability to have discontinuities is important in applications and is the only thing making the proof non-trivial. There are three proofs that I’m aware of
Mann and Wald used the ‘pointwise convergence of cdfs’ definition of convergence in distribution, which gives a painful proofEppur si muove
Thu, 02 Apr 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/04/02/eppur-si-muove/“The rules for winning the science competition focus on a small number of measures that incentivize poor practice” Hilda Bastian quoting Ottoline Leyser.
It’s all true, and more and worse besides. Researchers are driven by the incentives for high-impact publication; p-value hacking makes results seem more convincing than they are; trials use surrogate outcomes; glamour journals publish insufficiently-checked linkbait; predatory online journals will do anything for money; change and decay in all around we see.Pharmacy ethics
Sun, 29 Mar 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/03/29/pharmacy-ethics/‘You have heard that it was said to those of ancient times, “You shall not murder”; and “whoever murders shall be liable to judgement.” But I say to you that if you are angry with a brother or sister, you will be liable to judgement. Matthew 5:21-22
In practice, we have to distinguish. Whoever murders is liable to judgement, but being angry isn’t enough. In the same way, formal codes of professional ethics come in two versions: the aspirational code that describes the way we want the profession to be, and the legalistic code that describes what will get you kicked out.Paper helicopters at a science fair
Sat, 28 Mar 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/03/28/paper-helicopters-at-a-science-fair/Today, we ran Box’s paper helicopter experimental design example at the Science Street Fair sponsored by the NZ Association of Scientists and hosted by the Museum of Transport and Technology.
It went fairly well. In particular, the younger kids really liked dropping paper helicopters and comparing different designs and we got in a few useful discussions of experimental design with adults – mostly school teachers.
Things to note:
Use a photocopier and pre-printed design template, such as the one from the SixSigma package for R This lets you also produce a 2-up version, giving an extra interesting design factor.What does measurability mean?
Sat, 07 Mar 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/03/07/what-does-measurability-mean/Attention conservation notice: A long, meandering, and inconclusive attempt to explain why you perhaps shouldn’t worry about a technical issue you almost certainly weren’t worrying about already.
Mathematical proofs in statistics are, in some formal sense, useless. That is, they formally have conditions such as finite moments, boundedness, differentiability or stochastic equicontinuity that either apply to all things in the real world or to none. The proofs are also often formally about infinite sequences; these don’t crop up all that often in data analysis.How hard did you look: equivalence and non-inferiority
Fri, 27 Feb 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/02/27/how-hard-did-you-look-equivalence-and-non-inferiority/I usually don’t read nutripharma articles outside the mainstream media, but someone tweeted a link about saffron, which apparently cures everything. The last straw was a line beginning “Saffron, a major component of the Mediterranean diet…”
Saffron can’t really be described as a major component of anything, even risotto milanese, and it’s not unique to the Mediterranean region: it’s a well-known spice in India, Pakistan, Iran. And it’s not just the well-known places: England produced saffron before Italy produced tomatoes.Clinically proven ingredients
Thu, 26 Feb 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/02/26/clinically-proven-ingredients/NZ golf prodigy Lydia Ko has a sponsorship deal with a company that sells special jetlag-reducing water. She obviously knows how this sort of thing works, and what she said to the Herald was nicely crafted
Ms Ko said she was excited to have the support of 1Above.
“I haven’t really taken it for a long-haul flight before but I’ve seen some of the results and everything that comes with it and I have heard great things,” she said.Science and statistical inference
Tue, 17 Feb 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/02/17/science-and-statistical-inference/Q: There have been a lot of papers recently about the spike in p-values just below 0.05, haven’t there?
A: Yes. A bit depressing. But there’s a new analysis that says it’s ok.
Q: Really? That’s great!
A: Yes, Daniel Lakens shows that you can explain the recent increase in just-significant p-values, and that “data does not provide any indication of an increase in questionable research practices”
Q: And is his modelling correct?Assumptions and testing
Thu, 15 Jan 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/01/15/assumptions-and-testing/My attention was drawn on Twitter to an old (1999) paper in The American Statistician, “Different Outcomes of the Wilcoxon-Mann-Whitney Test from Different Statistics Packages.” The authors looked at 11 statistics packages and found they didn’t always give the same result for the Wilcoxon/Mann-Whitney test. The big problem was handling of tied observations.
Here are their example data:
The authors say “It is obvious that the data resulting from the experiment could not be analyzed by the Student’s t-test.A transitive test is a test for a univariate parameter
Wed, 14 Jan 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/01/14/a-transitive-test-is-a-test-for-a-univariate-parameter/As you know, rank tests can be non-transitive: they can have the rock-paper-scissors property. Tests that are for a single real-valued summary statistic (eg a test comparing means or medians or variance) are always transitive, because they are just comparing a single number, and ordering on numbers is transitive.
The converse is almost obviously almost true: if you have a transitive test, it almost has to be a test for a single real-valued summary statistic.New header picture
Mon, 12 Jan 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/01/12/new-header-picture/As you can see, there’s a new header picture to replace the generic tumblr theme. It’s a pair of Superb Fairywrens, aka blue wrens (Malurus cyaneus), one of my favourite Melbourne birds. They’re now quite common along the Melbourne coastline from St Kilda down along the bay.
For people in the Northern hemisphere: these are completely unrelated to the wrens you are familiar with, and are also unrelated to the New Zealand wrens.Tomato, tomato
Mon, 12 Jan 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/01/12/tomato-tomato/There are two* great commandments for conference session chairs
You shall adhere to the schedule with all your heart and all your mind and all your strength You shall pronounce the speakers’ names approximately as they do themselves For show-biz award ceremonies the first commandment doesn’t apply, but the second still does.
In order to pronounce someone’s name correctly, you need to ask for the correct pronunciation and have some way of remembering it, such as writing it down phonetically.Different questions can have different answers
Sun, 11 Jan 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/01/11/different-questions-can-have-different-answers/The Slate Money podcast [1] had an item on Manhattan apartment prices. The mean price last quarter was $1.7 million and the median was \$0.98 million. Firstly, that’s a lot of money. Secondly, the mean is a lot bigger than the median. The real point, though, is that the mean is a record, up on the previous peak (in 2008) by $120,000. The median is down from 2008, by $15,000.Variation explained and log transformation
Sat, 03 Jan 2015 00:00:00 +0000http://notstatschat.netlify.com/2015/01/03/variation-explained-and-log-transformation/This post is technical details for one at StatsChat on the Johns Hopkins “two-thirds of cancer is bad luck” paper.
I don’t have any real opinions on the conclusion: it’s clear that unforced errors in DNA copying will cause some cancers, and it’s not obvious how many. The technical problem with the paper (or at least with its publicity) is that the ‘proportion of variation explained’ was estimated for log risk and quoted as “two-thirds of cancers are due to bad luck’.How not to treat Ebola
Tue, 23 Dec 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/12/23/how-not-to-treat-ebola/From the Guardian, via Mark Henderson of the Wellcome Trust
Ebola patients at a treatment centre in Sierra Leone have been given a heart drug that is untested against the virus in animals and humans, a move that has been deemed reckless by one senior scientist and has prompted UK medical staff at the centre to leave.
Ebola is a problem for drug testing. You don’t want to leave people untreated, but you do want to find out as fast as possible what works.Citations: credit or blame
Sun, 14 Dec 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/12/14/citations-credit-or-blame/Katie Hinde at ‘Mammals Suck’ writes
Only cite papers that you have read! DO NOT cite papers based on another publication’s report of them. Because every time that happens, a science fairy dies.
That’s an excellent principle. So why do I have a paper in press that cites a paper I haven’t read?
There are two reasons to cite a paper: as evidence for a claim, or to give credit to the authors for their research.What science should everyone know?
Mon, 08 Dec 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/12/08/what-science-should-everyone-know/In response to the question “How much science knowledge should the average person have or should we just encourage people to ask questions?”
@NaomiShadbolt @petergnz Basics: Atoms; Evolution; “the lights in the sky are suns”; Randomisation; Conservation laws. And ask questions.
— Thomas Lumley (@tslumley) December 8, 2014
Expanding on this:
Atoms: everything is made of a very large but not infinite number of definite, basically indivisible, pieces, and there are very few different types (about 100).It depends on what you mean by 'cost'
Sun, 30 Nov 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/11/30/it-depends-on-what-you-mean-by-cost/The Tufts Center for the Study of Drug Development has a new cost estimate out: Cost to Develop and Win Marketing Approval for a New Drug Is $2.6 Billion.
The figure is probably fairly accurate as an estimate of what it’s trying to estimate, but it gets quoted in other contexts, so I think it’s worth looking at the number a piece at a time. The Tufts researchers haven’t provided enough information to do this, so I’m relying on estimates from Bruce Booth (who also has a spreadsheet that you can use for sensitivity analyses).This is just to say
Thu, 06 Nov 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/11/06/this-is-just-to-say/The plums, which you stored there on ice,
I have eaten; they went in a trice.
If you meant them to last
For a morning repast
Then I’m sorry, but boy were they nice.
or
Some say the plums will end in tarts
Some say on ice
From what I’ve eaten ’round these parts
I hold with those who favor tarts
But if they had to vanish first
I think I know enough of guiltA people set apart
Wed, 05 Nov 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/11/05/a-people-set-apart/There’s a conference coming up at Deakin University in Melbourne, on energy drinks. The unusual aspect of the conference is that no-one who has received industry funding is welcome. Obviously the energy drink industry aren’t happy about this. I couldn’t give a fsck about their hurt feelings, but I hope this sort of policy doesn’t spread.
Now, I’m not completely naïve about the sorts of things some industry groups will do when there’s a lot of money at stake.Miasma and Contagion
Sat, 25 Oct 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/10/25/miasma-and-contagion/Scientists have a nasty habit of taking ordinary English words, turning them into technical terms, and then insisting that the ordinary use is Just Wrong. ‘Organic’, which I’ve written about before, is a good example.
On the other hand, sometimes the scientists are right. I complained on Twitter last night about the phrase ‘meningococcal virus’ in a Herald opinion piece on state housing, and I have previously complained about the ‘Psa virus’ for the bacterium Pseudomonas syringae pv.Semiparametric efficiency and nearly-true models
Sat, 25 Oct 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/10/25/semiparametric-efficiency-and-nearly-true-models/Suppose you have \(N\) people with some variables measured, and you choose a subset of \(n\) to measure additional variables. I’m going to assume the probability \(\pi_i\) that you measure the additional variables on person \(i\) is known, so it has to be a setting where non-response isn’t an issue – eg, choosing which frozen blood samples to analyse, or which free-text questionnaire responses to code, or which medical records to pull for abstraction.Broman's Socks and the Nature of Scientific Reporting
Mon, 20 Oct 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/10/20/bromans-socks-and-the-nature-of-scientific-reporting/Rasmus Bååth wrote a post using Approximate Bayesian Computation to estimate a posterior distribution for Karl’s socks. What he didn’t consider was the impact of publication bias. In order for us to see the tweet, it was not only necessary that Karl’s first 11 socks were distinct, it was also necessary that he found this remarkable, and, probably, that no-one he follows on Twitter had made a similar laundry-related observation at any recent time.Is it good or bad when confounding adjustment makes no difference?
Wed, 24 Sep 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/09/24/is-it-good-or-bad-when-confounding-adjustment-makes-no-difference/There’s a new paper out in J Epi Community Health, looking at the relationship between perceived job insecurity and incident asthma. NHS ‘Behind the Headlines’ covers it well.
One of the interesting things about the paper is that the crude relative risk between above/below 50% estimated risk of losing your job is 1.61, and the relative risks after adjustment in three increasingly-complex models are 1.58, 1.62, and 1.61. That is, the adjustment for confounding has no impact at all.On dialect
Fri, 29 Aug 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/08/29/on-dialect/In New Zealand, ‘radiata’ and ‘macrocarpa’ are accepted common names for two widely planted non-native conifers: Pinus radiata and Cupressus macrocarpa, known in their native US as ‘Monterey pine’ and ‘Monterey cypress’ respectively.
It’s unusual for the specific epithet of a plant to become the common name. There are plenty of examples of the generic name becoming the common name, from ‘bougainvillea’ to ‘wisteria’. There are even plenty of examples where a former generic name has stuck as the common name after the botanists have renamed the plant to, eg, Pelargonium, Hippeastrum, or Corymbia.Rhetorical sensitivity analysis
Fri, 29 Aug 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/08/29/rhetorical-sensitivity-analysis/Rhetorical sensitivity analysis “The ethanol in alcohol is a group one carcinogen, like asbestos,” Prof. Doug Sellman, Otago University (July 2013)
Professor Sellman is correct, of course. What’s more, alcohol is even an important cause of cancer. From the viewpoint of rhetoric and risk communication it’s still interesting to see how the effect of the sentence changes when other familiar IARC Group I carcinogens are substituted for ‘asbestos’
alcohol is a group one carcinogen, like sunlight alcohol is a group one carcinogen, like birth-control pills alcohol is a group one carcinogen, like plutonium alcohol is a group one carcinogen, like tobacco alcohol is a group one carcinogen, like arsenic, alcohol is a group one carcinogen, like wood dust None of these really has the quite same rhetorical impact; the only one that comes close is ‘tobacco’.O necessary sinpi
Wed, 27 Aug 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/08/27/o-necessary-sinpi/The R help page for sin, cos, and tan, mentions functions sinpi, cospi, tanpi, “accurate for x which are multiples of a half.” This struck someone I know as strange. I’ve been thinking about this sort of thing recently while teaching Stat Computing, so here’s some background.
If you’re a mathematician, \(\sin x\) is given by a power series
\[\sin x = x - \frac{x^3}{3!}+\frac{x^5}{5!} -\frac{x^7}{7!} +-\cdots\]
This series converges for all \(x\), and so converges uniformly on any finite interval.Taking meta-analysis heterogeneity seriously
Sun, 24 Aug 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/08/24/taking-meta-analysis-heterogeneity-seriously/In fixed-effects meta-analysis of a set of trials the goal is to find a weighted average of the true treatment effects in those trials (whatever they might be). The results are summarised by the weighted average and a confidence interval reflecting its sampling uncertainty.
In random-effects meta-analysis the trials are modelled as an exchangeable sample, implying that they can be treated as coming independently from some latent distribution of true treatment effects.Survey package update
Fri, 15 Aug 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/08/15/survey-package-update/There’s a new version, 3.30-3, of the ‘survey’ package for R. It’s got quite a lot of new stuff:
AIC and BIC for generalised linear models Rank tests for more than two groups Logrank and generalised logrank tests Since I’m known for a lack of enthusiasm about any of these techniques, why are they in the package? Am I just enabling?
Well, AIC and BIC are interesting, and I’ll say more below.Feynman and the Suck Fairy
Sat, 12 Jul 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/07/12/feynman-and-the-suck-fairy/There’s been a bit of…discussion…about Richard Feynman recently. In one Twitter conversation, Richard Easther said he had been thinking of using Feynman’s commencement address “Cargo Cult Science” with a first-year physics class, and had decided against.
I was a bit surprised. It’s been a long time since I read that piece, but I couldn’t remember anything objectionable in it. So I re-read it. It’s still really good in a lot of ways.Herd Immunity simulations
Sun, 01 Jun 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/06/01/herd-immunity-simulations/Especially for vaccines that are not 100% effective, a large chunk of the benefit comes from ‘herd immunity’, the fact that incomplete vaccination makes it harder for an epidemic to get started and spread. Increasing the proportion of people vaccinated helps those people, and it also helps the people who aren’t vaccinated.
Here’s a set of simulations (code, needs FNN package and R) that show the effect. There is a simulated population of 10,000 people living on a square (actually, a doughnut, since it wraps around).Monotonicity and smoothness
Thu, 22 May 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/05/22/monotonicity-and-smoothness/Andrew Gelman has an interesting discussion of monotonicity as a modelling constraint. I basically agree with what he says, but since my first real statistical research (my M.Sc. thesis) was on order restrictions I thought I’d write about a related aspect of the problem.
Assuming that a relationship is monotone sounds like a very strong assumption, and therefore one that you’d expect to gain a lot by making. Asymptotically, this isn’t true.Anchoring bias
Sun, 18 May 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/05/18/anchoring-bias/Anchoring bias: high school students asked to add up the digits in their phone number and to estimate how many countries there are in Africa.
(phew, it worked)
(I did delete one data point as non-responsive: estimated number of countries in Africa was 1)
(with adults I’d use last two digits of phone number, but with teenage girls I thought a bit more information-hiding was appropriate)Randomisation without consent
Wed, 14 May 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/05/14/randomisation-without-consent/The issue of randomisation without consent has come up in New Zealand. Because I’m on the HRC Data Monitoring Core Committee, which monitors some NZ clinical trials I don’t want to say much about any current NZ clinical trials, even ones we’re not monitoring. I do want to talk about the principle.
The always-useful NZ Science Media Centre has rounded up a couple of bioethicists on the topic, and you should read what they say.Einstein, Wikiquote, and fact checking
Fri, 14 Mar 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/03/14/einstein-wikiquote-and-fact-checking/It’s not only Pi Day in the USA (3/14, they write dates backwards), it’s Einstein’s 135th birthday. Einstein, like Mark Twain, 孔夫子, Churchill, Disraeli, and the Chinese proverbs, is a quote magnet. He said many quotable things, and even more are attributed to him.
The NZ Herald has a list of ten Einstein quotes. Annoyingly, none of them say where or when they were said. So I did the absolutely minimal level of fact checking.My likelihood depends on your frequency properties
Tue, 04 Mar 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/03/04/my-likelihood-depends-on-your-frequency-properties/The likelihood principle states that given two hypotheses \(H_0\) and \(H_1\) and data \(X\), all the evidence regarding which hypothesis is true is contained in the likelihood ratio \[LR=\frac{P[X|H_1]}{P[X|H_0]}.\]
One of the fundamentals of scientific research is the idea of scientific publication, which allows other researchers to form their own conclusions based on your results and those of others. The data available to other researchers, and thus the likelihood on which they rely for inference, depends on your publication behaviour.Chemical nerdview
Tue, 25 Feb 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/02/25/chemical-nerdview/One of Stephen J. Gould’s essays contains the admission
I confess that I have always been greatly amused by the term primate, used in its ecclesiastical sense as “an archbishop … holding the first place among the bishops of a province.” My merriment must be shared by all zoologists, for primates, to us, are monkeys and apes–members of the order Primates.
…
But this amusement is silly, parochial, and misguided.This is a wug. Now you have two of them.
Sun, 09 Feb 2014 00:00:00 +0000http://notstatschat.netlify.com/2014/02/09/this-is-a-wug.-now-you-have-two-of-them./Three words that used to be plurals, and are changing in three different ways:
Candelabra used to be the plural of candelabrum, a multiple-armed candlestick holder. There are very few other English words ending in ‘brum’, and most of the words ending in ‘bra’ are singular (e.g. vertebra, penumbra, cobra, zebra, sabra, bra). Over time, candelabra has been used more and more often as the singular, perhaps most famously in the biographical movie “Behind the Candelabra” about Liberace; the corresponding plural is candelabras.At risk of vanishing
Sat, 14 Dec 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/12/14/at-risk-of-vanishing/A degree in science, in addition to specific facts about squid, neutrinos, or palladium-catalysed cross-couplings, should teach students what to do with questions about the world. In particular, they should learn to think about what the implications would be of each answer to the question, and know how we might use these implications to rule out some of the answers and reduce our uncertainty about others.
A degree in the humanities, in addition to specific facts about tenses in French, resource-allocation procedures in village societies, or the development of the Sangam literature, should teach students what to do with questions about the world.Moving the goalposts?
Fri, 15 Nov 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/11/15/moving-the-goalposts/There’s a paper in PNAS suggesting that lots of published scientific associations are likely to be false, and that Bayesian considerations imply a p-value threshold of 0.005 instead of 0.05 would be good. It’s had an impact outside the statistical world, eg, with a post on the blog Ars Technica. The motivation for the PNAS paper is a statistics paper showing how to relate p-values to Bayes Factors in some tests.From labhacks: the $25 scrunchable scientific poster
Mon, 04 Nov 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/11/04/from-labhacks-the-25-scrunchable-scientific-poster/labhacks:
image
Printed on Spoonflower performance knit at 300 dpi. 36” x 56”, vivid colors, no unraveling, and minimal wrinkling, even after being stuffed in a backpack. Hangs straight with about 8 pins. Print cost is $22 with $3 shipping.
image
A diversity of gifts, but the same spirit
Wed, 30 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/30/a-diversity-of-gifts-but-the-same-spirit/Peter Green used this line (from I Corinthians) for his Royal Statistical Society Presidential Address in 2003, which anyone interested in the future of statistics should read. I’ve been planning to steal it ever since then, and the time seems right.
Roger, Jeff, and Rafa at Simply Statistics are holding an unconference on the future of statistics, some time before dawn tomorrow morning New Zealand time. I probably won’t be attending, but if you’re in a more compatible time zone it promises to be interesting.Interaction: 'real' and statistical
Sun, 27 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/27/interaction-real-and-statistical/Confounding is a model-independent property of nature: if doing A has a particular effect on Y, it is objectively either true or untrue that the conditional distributions of Y given A and not A match that particular effect. Interaction or effect modification is scale-dependent: you ask “is the effect of A on X in the presence of B the same as the effect of A on X in the absence of B.Barren proxies
Sun, 20 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/20/barren-proxies/In causal inference it is often the case that you can’t obtain a confounding variable directly, you can only measure something that it affects. Judea Pearl correctly points out the danger of conditioning on a ‘barren proxy’ for a confounder, in situations like this one:
A confounds the effect of B on C. D is affected by A but does not directly affect either B or C, so it is a ‘barren proxy’ for A.Google completions and sexism
Fri, 18 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/18/google-completions-and-sexism/The new ads produced by UN Women illustrating widespread sexism using Google autocomplete are pretty chilling, eg, The ads are convincing and what they imply is true, but I’m less sure that they are actually good evidence for what they imply.
Typing whole phrases into Google is not how I or people I’ve watched usually search. I type key words. The only reason I would search for a phrase such as “Women should not speak in church” would be to find the source.Do you know where it's been?
Thu, 10 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/10/do-you-know-where-its-been/Again this week on the bus I passed the annoying Phoenix Organics delivery van that says Don’t drink science, you don’t know where it’s been [ Phoenix are also notable for their aspartame scare page. ]
One of the things they don’t write up as glowingly is that they add the synthetic version of a natural antioxidant to their juices and their (naturally high-fructose) juice drinks, in unnaturally high concentrations. That’s perfectly legal by organic labelling rules, and the compound, ascorbic acid, is the same molecule as vitamin C – but we’re talking about an industry where ‘the same molecule’ doesn’t usually cut it as an excuse.Rock, paper, scissors, Wilcoxon test
Sun, 06 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/06/rock-paper-scissors-wilcoxon-test/Based on my nerdnite talk last week.
Transitivity is a basic property of orderings: if A is better than B and B is better than C, then A must be better than C. For example, if the All Blacks beat Tonga and Tonga beats Japan, we would expect the All Blacks to beat Japan.
Rock-paper-scissors is interesting because it is the opposite: if A beats B and B beats C then A must lose to C.Today we have shaming of prats
Sun, 06 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/06/today-we-have-shaming-of-prats/Background: Revenge, ego, and corruption in Wikipedia
Today we have shaming of prats. Yesterday, we had unclear sources, and tomorrow morning
we may have original research. But today, Today we have shaming of prats. The Flame Robin
is a small passerine bird native to Australia,
and today we have shaming of prats
This is the anonymous editor. And this
is the revenge edit, whose use you will see
when you are snubbed by an author.Auckland's top news story
Thu, 03 Oct 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/10/03/aucklands-top-news-story/Background: The Herald has had a whole sequence of stories, including two on the front page, about the decision of the city council to stop mowing the ‘berms’, the strips of grass between the road and sidewalk.
Additional background: the Mayor of Auckland is one Len Brown.
O Lenny Boy, the berms, the berms need mowing
From Glen to Lynn, and down Mt Eden side
The winter’s gone, and all the grass is growingStatins and the causal Markov property
Mon, 23 Sep 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/09/23/statins-and-the-causal-markov-property/A real example: there is developing uncertainty that the statin class of cholesterol-lowering drugs really works by lowering LDL cholesterol. This is partly because other drugs (eg, ezetimibe) that lower LDL cholesterol don’t have the same impact on heart attacks, and also because statins seem to have beneficial effects on too many other conditions. In principle, you could control for achieved cholesterol levels and see if statin use was then conditionally independent of heart disease [adjust and see if the effect goes away].PBRF consultation response consultation
Sun, 22 Sep 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/09/22/pbrf-consultation-response-consultation/Friday week (October 4) is the deadline for providing input to the consultation on changes in the PBRF process (for foreigners: the national research evaluation program that allocates a chunk of long-term research funding to universities). Here’s the consultation document, if you haven’t read it yet.
This is what I’m planning to say. It’s also open for public feedback.
Background: I was a member of the PBRF MIST review panel, and also submitted a portfolio.An absolutely minimal way to increase invited speaker diversity
Fri, 13 Sep 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/09/13/an-absolutely-minimal-way-to-increase-invited-speaker-diversity/The low proportion of women among invited speakers at conferences has finally become an issue in biology and computing and science fiction (at least for the people in my Twitter feed).
You might worry, if you were running a conference, that having some sort of minimum standard for diversity might lead to suboptimal speakers, or if you had a bunch of small sessions, might be difficult to ensure in each session.What I said on StatsChat only shorter and with more swearing
Wed, 04 Sep 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/09/04/what-i-said-on-statschat-only-shorter-and-with-more-swearing/Stealing a Keith Ng title, to do a post motivated by his criticism of my StatsChat post as a ‘generous interpretation’ of Bill English.
English said that households with income below $110000 collectively paid no net income tax. This assumes that all benefits are paid solely from income tax, not GST, and even then has to lump together people who receive more in benefits than they pay in income taxes with a lot of people who pay much more in income tax than they receive in benefits.On the persistence of variation in horn size among Soay sheep
Fri, 23 Aug 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/08/23/on-the-persistence-of-variation-in-horn-size-among-soay-sheep/(BBC News)
The small-horned rams are fitter, but the big-horned rams are phatter
And though we deemed it sweeter
To dally with the latter,
The small horns still stay with us
To Scottish boffins’ wonder;
In flocks the nerds still pull the birds
When the jocks are six feet under
(After Peacock)A layperson's view of a science communication problem
Tue, 13 Aug 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/08/13/a-laypersons-view-of-a-science-communication-problem/There’s a story in one of the NZ papers saying that Fonterra and the government are completely wrong about the source of the botulism contamination in milk products and about how to fix it.
This is a field I know very little about, so it’s interesting to look at the story just from the point of view of an educated consumer.
There are some stylistic points that make the story look like it could be bogus: the claim that this one guy is right and everyone else is wrong, the reference to “sitting on material that will embarrass Fonterra further”, blaming the problem on glyphosate (evil Monsanto’s evil Roundup herbicide), the lack of any links or details for the research, and the lack of any independent scientific opinion.SPEED sessions at JSM 2013
Fri, 09 Aug 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/08/09/speed-sessions-at-jsm-2013/This year, the Joint Statistical Meetings introduced a combined poster/short presentation session. The sessions took up a half-day, twice as long as the typical session. In the first half, each presenter gave a 5-minute talk, and the second session was an electronic poster session. I signed up for one of these sessions primarily because I think any form of innovation at the Joint Statistical Meetings should be supported. For people who haven’t experienced the JSM, it’s the largest gathering of statisticians in the world, but it is also characterised by rigid and apparently inexplicable rules (eg, if the chair for your session doesn’t show, you are supposed to find a replacement who isn’t chairing any other session in the whole conference) and a significant minority of astoundingly awful talks.In defense of theory
Thu, 08 Aug 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/08/08/in-defense-of-theory/The statistical community, and even more so the statistical curriculum, hasn’t yet adapted fully to the improvements in computing over the past few decades, and so still gives too much priority to mathematical approaches and too little to computational approaches to many problems. That’s one reason for the popularity of the term ‘data science’, and for the mixed feelings about Nate Silver’s comment at his JSM address that ‘data scentist is just a sexed-up term for statistician’.Some failure modes of statistics research talks
Sun, 04 Aug 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/08/04/some-failure-modes-of-statistics-research-talks/Written before #JSM2013 actually starts, so it’s not about your talk there.
Also, this is about deliberate choices by the presenter, and specifically about statistics research talks. “The Overgeneralized Beta Distribution”. There is a place for new parametric distributions, but it’s a fairly small place and mostly occupied by distributions derived from underlying substantive knowledge.
“Asymptotics of an uninteresting estimator”. If there were a novel mathematical idea this would be fine, but otherwise we know its asymptotic behavior and roughly why it happens, and we can’t read your notation fast enough anyway.Graphs and counterfactuals
Mon, 15 Jul 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/07/15/graphs-and-counterfactuals/The two main ways of reasoning about cause and effect in statistics are causal graphs and counterfactuals.
With causal graphs, you write down variables and draw arrows representing direct effects of one variable on another, and then work with a set of axioms that summarise what it means for one variable to affect another. With counterfactuals, you talk about the effect of a variable in terms of the difference between the actual outcome with the variable set one way and the ‘potential outcome’ if it had been set another way.Welfare as an addictive drug
Mon, 15 Jul 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/07/15/welfare-as-an-addictive-drug/From the NZ Herald today Doctors have been told that putting patients on welfare is akin to putting them on “an addictive debilitating drug … not dissimilar to smoking”.
Smoking is a really, really bad analogy here, since doctors would absolutely never recommend a patient starts smoking. It’s hard to imagine how someone with a medical degree could come up with that analogy. Welfare is hard to get off and probably bad for your health, but a better comparison would be something like sleeping pills or opioid analgesics: drugs that are risky, potentially dependence-inducing, and should be taken for the shortest possible time period, but that are absolutely medically necessary at times.Big data linear models
Mon, 08 Jul 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/07/08/big-data-linear-models/The biglm package for R currently uses incremental QR decomposition, which fits linear models to big data in linear time and bounded memory, but doesn’t parallelize.
It turns out that parallel computation is easy (and has been studied by Dongarra and the LAPACK folks). If you have two data chunks reduced to \(R_1\) and \(Q_1^TY_1\), and \(R_2\) and \(Q_2^TY_2\), just treat each \(R\) as an \(X\) and each \(Q^TY\) as a \(Y\) to merge the QR decompositions.Sparse linear systems and calibration of weights
Mon, 08 Jul 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/07/08/sparse-linear-systems-and-calibration-of-weights/Diego Zardetto (Italian national stats agency) wants to be able to calibrate sampling weights to population totals for regions. This leads to a very large number of calibration variables and solving large linear systems.
Using the Matrix package in R, we can compute sparse QR decompositions instead of the dense ones used in the survey package. Alternatively, using block-diagonal sparse matrices from the bdsmatrix package we can represent the linear system as a set of separate systems for each region.Problems with faithfulness and the causal Markov property (II)
Sat, 06 Jul 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/07/06/problems-with-faithfulness-and-the-causal-markov-property-ii/This one I got from reading Nancy Cartwright’s Hunting Causes, and using them, though it isn’t exactly the point she’s making. It’s also related to points made by Hofstadter, Dennett, and others about reductionist reasoning. The idea of causal graphs is that you have variables and some prior knowledge of possible causal relationships between them – the prior knowledge could be as weak as ‘future cannot cause past’ or could incorporate a lot of domain-specific knowledge.Problems with faithfulness and the causal Markov property (I)
Tue, 02 Jul 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/07/02/problems-with-faithfulness-and-the-causal-markov-property-i/The causal Markov property says that you can write down causal relationships between variables in a directed acyclic graph so that each variable is affected only by its parents in the graph. The faithfulness property says that the variables will have exactly the conditional independence properties required by the graph.
The first problem with these properties is measurement error. If the only causal relations are that A affects B and C, then B and C are conditionally independent given A.Upcoming talks and stuff
Sun, 30 Jun 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/06/30/upcoming-talks-and-stuff/Two modules (on intermediate and advanced R) with Ken Rice at the Seattle Summer Institute in Statistical Genetics
Joint Statistical Meetings: analyzing large data with SQL generated by R, with Hannes Muhleisen. MonetDB is a database optimised for analysis tasks, and controlling it from R gives more flexibility and programmability. Two simple notes on error in regression models
Fri, 28 Jun 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/06/28/two-simple-notes-on-error-in-regression-models/In regression, we often talk about the difference between the population line and the observations as “errors.” In some introductory texts these are even called “measurement errors” in Y. Sometimes they are errors in Y, and sometimes they are even measurement errors in Y, but much more often Y is the truth and the ‘error’ is the error in predicting Y by a straight line. As Dan Davies observed (from memory) “The Great Depression really happened; it wasn’t just an unusually inaccurate observation of an underlying 4% return on equities”When is Bayesian introductory statistics better?
Thu, 27 Jun 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/06/27/when-is-bayesian-introductory-statistics-better/For the sort of statistics taught in introductory courses, competent Bayesian and frequentist analysis are going to agree – point and interval estimates will be similar, and similar conclusions will be drawn. Computation isn’t seriously hard for either approach, though prepackaged pointy-clicky software is more available for frequentist inference. There are going to be pedagogical differences. The big one, in favour of Bayesian statistics, is not having to explain p-values. Another benefit of the Bayesian approach is that it gives you a good reason not to talk about rank tests.My Setup
Sat, 08 Jun 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/06/08/my-setup/For UsesThis.com, prompted by Luis Apiolaza.
Who are you, and what do you do?
I’m a statistics professor at the University of Auckland. I teach, do research in statistics and in epidemiology, and contribute to R.
What hardware do you use?
I’ve been using Mac laptops since 2001. I currently have an aluminium MacBook from 2009 and am waiting on delivery of an 11in MacBook Air. They are nicely solid, have reasonable keyboards, run Unix, and support the Microsoft software that my collaborators use.Hello World
Fri, 07 Jun 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/06/07/hello-world/Hello, world This will be my not-StatsChat blog, for things that are too technical, too political, or simply not relevant to StatsChat.Talks in the near future
Fri, 07 Jun 2013 00:00:00 +0000http://notstatschat.netlify.com/2013/06/07/talks-in-the-near-future/“Filtering for rare variant tests” CHARGE consortium workshop, Rotterdam. Basic message: it’s total count of rare alleles that matters.
“R: an environment for statistical computing and graphics”. CWI, Amsterdam. Talking about the history, design, and applications of R to a scary computer-science audience.
“Testing rare DNA variants in unrelated individuals: experience from the CHARGE consortium”, IARC, Lyon. On unidirectional and omnidirectional ‘burden of mutation’ tests using rare DNA variants.
“Analysing very large surveys with SQL generated from R” ITACOSM conference, Milan.Lorem Ipsum
