3 min read

Factors as factors

With the long-awaited demise of stringsAsFactors=TRUE it’s now easier to use text strings in R. It’s good that strings don’t automatically get turned into factors at read time, but the price is that strings don’t automatically get turned into factors at read time: if you have variables that need to be factors, you have to turn them into factors yourself.

Factors are still a important data type in R. They are R’s enumerated type; a factor knows what its possible levels are. You can turn factors into indicator variables for categories, which is part of constructing tables and in regression models and graphics and other inferential and exploratory statistics. In particular, while a string variable can have one, or two, or 497 copies of “Alabama”, only a factor variable can usefully have zero copies. Only with a factor variable, can you distinguish “this is not a level” from “this is a level but we don’t have any”.

If you attempt to perform factor-based operations on a string variable, R will treat the variable like a factor (and may even run factor or as.factor on it). Since the string variable doesn’t know what levels it has, R will assume that only the levels with at least one copy exist; there are no levels with zero copies. If you don’t have anyone from Puerto Rico in your data vector, Puerto Rico is no longer in America.

Not surprisingly, if you just pass strings into functions that expect factors, you will sometimes get the wrong answer. The most likely reason to get the wrong answer is that your string vector gets split up into subvectors, and these subvectors (which obviously have the same set of valid levels) have different sets of values present. When these subvectors are treated as factors, they end up with different sets of levels and so are incompatible types.

How can you make sure your variable end up with the right factor type? There are two approaches. First, convert to a factor as soon as possible, so all the original levels are still present. There are two complications here: a particular data set might never have had all the levels, or you might have spurious string values that aren’t levels and need data cleaning. It’s still probably the best approach for most cases.

The other approach is to be explicit about levels. Create a vector containing (one copy of) all the levels, and use factor(my_string, levels=all_the_levels). The advantage here is that you can get the full set of levels from some canonical source, if there is one. The disadvantage is that it may be clumsy if there isn’t a canonical source.