Planning a new data management course - Biased and Inefficient

We have an existing course at third-year undergraduate level that reprises the second-year regression course using SAS and also does a bit more SAS for data management. We’re going to get rid of it, and this is an early draft of my plan for its replacement (probably in 2022).

The original working title for the replacement was “Ecumenical Data Wrangling”: that is, data management taught in a form that isn’t specific to SAS or specific to R (or, Lorde help us, Excel). I want to keep SAS: employers are still interested in it here, but also because I want one of the basic ideas to be data models and how they differ between packages.

Other important ideas are reproducible workflow (and its higher-level cousins such as data governance) and the experience that you can’t trust the data just because the computer says so.

The prerequisites will likely be our second-year course in linear regression (with a little glm at the end), and our basic stat computing course, which was based around Paul’s book but is now being revised by Earo Wang. It would be nice not to have a computing prerequisite, but I don’t think that’s feasible at undergraduate level.

STATS 361: Principles of Data Management

This course provides a structured introduction to the science and craft of data management. We will start byconsidering data representations and their advantages and disadvantages, then cover combining and splitting data sets, data cleaning, the creation of non-trivial summary variables and the handling of missing data. These principles will be illustrated by data sets of varying size and complexity, and students will implement dataprocessing steps in at least two of R, SAS, and (hopefully) Python

Data Models

Behind every wildly successful tool there’s probably a very powerful abstraction. — @JennyBryan tweet

Rather than thinking about file formats and extensions, we should think about the data models that are usedby particular tools. The most familiar one in science and commerce is the spreadsheet; in previous statistics courses you will have seen the database table or data frame. Using a tool that doesn’t match your data model will make your life less pleasant – and data model is a separate choice from file format, since .csv files are used for spreadsheets, data tables, and matrices.

Spreadsheet
Data table (data frame, panda, database table)
tidy data table
Matrix
List (R lists, JSON, XML)

Data tables add structure to spreadsheets and reduce flexibility: one type per column, all columns the same length, invariant to row reordering. The basic SAS dataset structure has the same constraints on columns, but is strongly row-oriented and row-ordered (at least in the DATA STEP)

Matrices add different structure (not invariant to reordering, all numeric).

Lists have only the recursive structure but may implicitly be or contain data tables. Auckland Transport’s bus location feed is a good example.

Special cases: time series, spatial data, sparse matrices

Reading:

Broman & Woo: ‘Data Organization in Spreadsheets’, TAS, doi:10.1080/00031305.2017.1375989
Wickham: ‘Tidy Data’ JSS, doi:10.18637/jss.v059.i10

Workflow principles

Your closest collaborator is you six months ago, but you don’t reply to emails. — @gonuke, after @kcranstn tweet quoting @mtholder

Data management provides an opportunity to irreversibly mangle valuable data, or to introduce subtle inconsistencies, or just to set future-you up for days of unnecessary work

changes to raw data should be trackable, reversible, documented as to who and why
never touch the raw data in analysis
save code to reproduce all analyses
save intermediate results if they are slow to recreate, otherwise just recreate them
construct tables and graphs computationally, not by hand

Ethics and regulation: confidentiality, data security, data governance, data sovereignty

Some end-to-end examples of high-level workflow: StatsNZ? a clinical trial? industry?

Bulk data transformations

and I’m still pretty sure some of the data is missing, but it could still be here, in this ONE HUNDRED SHEET excel file— @RallidaeRule tweet

Data table:

join: combining data from multiple tables (inner, outer, left, anti)
filter: subset of rows
select: subset of columns
pivot: longer vs wider (and relationship with normalisation/denormalisation)

Conversions between data models, eg

JSON to data table (Auckland transport bus API)
directory full of spreadsheets to data table.

Data cleaning: what else is wrong with your data?

Classroom data are like teddy bears; real data are like a grizzly with salmon blood dripping out its mouth.— @JennyBryan, via @sgrifter tweet

Univariate summaries
Bivariate summaries
Plausible mistakes (eg wrong units, typos, OCR errors)
Wrong variable or value labels
Reasons for wrong data:
- data entry
- automatic conversions (eg Excel dates)
- wrong units
- errors in the measurement process
- lack of construct validity (ie, your measurement is what you asked for, but it doesn’tmean what you think it does)
Real measurements incorrectly coded as invalid: eg the ozone hole

Reading: Deborah Stone The Ethics of Counting James Madison Award Lecture

Constructing new variables

Of course someone has to write loops. It doesn’t have to be you. —@JennyBryan slides

Row and column sums, means, medians
Longitudinal data: operations on contiguous ordered rows, eg, indicator variable for last visit missed, averageconsumption over past three weeks
Other subset operations defined by sets of rows (the split-analyse-recombine idiom in R and DATA STEP equivalents in SAS)

Part of the goal here is to get some evidence-based idea of what things, if any, really are easier in SAS than in R

Text

Splitting (tokenising), tabulating, summarising
Very basic regular expressions
ASCII, code pages, and Unicode (how to use te reo Māori and NZ placenames)
Names of people and places

Possible example:

Hamlet
merging different NZ official data sets by name

Readings:

Missing data

Deliberate vs haphazard missingness
Item vs unit missing data
Monotone vs non-monotone missing data

Obvious things to worry about:

missing data codes (9, -9, -99) [Stata and SPSS can code reason for missing, R doesn’t have low-level support for this]
censoring/limit of detection vs measurement
NULL vs NA vs NaN: how do missing values sort? How do they propagate?
how is item missingness related to non-missing data? (exploratory graphs when data are missing: R naniar package)
Ethics of measurement: can’t just give up on people who are harder to measure.
Special things to worry about:
- Country codes: NA is Nambia
- Unknown locations coded to centroid: Null Island; Wayne Dobson does not have your cell phone; Mapping kidnappings in Nigeria
- file truncation, especially at a power of two